"
This series addresses issue #3575 by adding 3 ALLOW FILTERING related
metrics to help profile queries:
* number of read request that required filtering
* total number of rows read that required filtering
* number of rows read that required filtering and matched
Tests: unit (release)
"
* 'allow_filtering_metrics_4' of https://github.com/psarna/scylla:
cql3: publish ALLOW FILTERING metrics
cql3: add updating ALLOW FILTERING metrics
cql3: define ALLOW FILTERING metrics
The following metrics are defined for ALLOW FILTERING:
* number of read request that required filtering
* total number of rows read that required filtering
* number of rows read that required filtering and matched
"
This is a series of patches which make it possible for a human to examine
contents of cache or memtables from GDB.
"
* 'tgrabiec/gdb-cache-printers' of github.com:tgrabiec/scylla:
gdb: Add pretty printer for managed_vector
gdb: Add pretty printer for rows
gdb: Add mutation_partition pretty printer
gdb: Add pretty printer for partition_entry
gdb: Add pretty printer for managed_bytes
gdb: Add iteration wrapper for intrusive_set_external_comparator
gdb: Add iteration wrapper for boost intrusive set
A report that C-Ares returned some errors tells the user nothing.
Improve the error message by including the name of the configuration
variable and its value.
Message-Id: <20180705084959.10872-1-avi@scylladb.com>
"
The main idea of this series is to provide a filtering_visitor
as a specialised result_set_builder::visitor implementation
that keeps restriction info and applies it on query results.
Also, since allow_filtering checking is not correct now (e.g. #2025)
on select_statement level, this series tries to fix any issues
related to it.
Still in TODO:
* handling CONTAINS relation in single column restriction filtering
* handling multi-column restrictions - especially EQ, which can be
split into multiple single-column restrictions
* more tests - it's never enough; especially esoteric cases
like filtering queries which also use secondary indexes,
paging tests, etc.
Tests: unit (release)
"
* 'allow_filtering_6' of https://github.com/psarna/scylla:
tests: add allow_filtering tests to cql_query_test
cql3: enable ALLOW FILTERING
service: add filtering_pager
cql3: optimize filtering partition keys and static rows
cql3: add filtering visitor
cql3: move result_set_builder functions to header
cql3: amend need_filtering()
cql3: add single column primary key restrictions getters
cql3: expose single column primary key restrictions
cql3: add needs_filtering to primary key restrictions
cql3: add simpler single_column_restriction::is_satisfied_by
Enables 'ALLOW FILTERING' queries by transfering control
to result_set_builder::filtering_visitor.
Both regular and primary key columns are allowed,
but some things are left unimplemented:
- multi-column restrictions
- CONTAINS queries
Fixes#2025
If any restriction on partition key or static row part fails,
it will be so for every row that belongs to a partition.
Hence, full check of the rest of the rows is skipped.
In order to filter results of an 'ALLOW FILTERING' query,
a visitor that can take optional filter for result_builder
is provided. It defaults to nop_filter, which accepts
all rows.
Previous implementation of need_filtering() was too eager to assume
that index query should be used, whereas sometimes a query should
just be filtered.
"
This series fixes the "LCS data-loss bug" where full scans (and
everything that uses them) would miss some small percentage (> 0.001%)
of the keys. This could easily lead to permanent data-loss as compaction
and decomission both use full scans.
aeffbb673 worked around this bug by disabling the incremental reader
selectors (the class identified as the source of the bug) altogether.
This series fixes the underlying issue and reverts aeffbb673.
The root cause of the bug is that the `incremental_reader_selector` uses
the current read position to poll for new readers using
`sstable_set::incremental_selector::select()`. This means that when the
currently open sstables contain no partitions that would intersect with
some of the yet unselected sstables, those sstables would be ignored.
Solve the problem by not calling `select()` with the current read
position and always pass the `next_position` returned in the previous
call. This means that the traversal of the sstable-set happens at a pace
defined by the sstable-set itself and this guarantees that no sstable
will be jumped over. When asked for new readers the
`incremental_reader_selector` will now iteratively call `select()` using
the `next_position` from the previous `select()` call until it either
receives some new, yet unselected sstables, or `next_position` surpasses
the read position (in which case `select()` will be tried again later).
The `sstable_set::incremental_selector` was not suitable in its present
state to support calling `select()` with the `next_position` from a
previous call as in some cases it could not make progress due to
inclusiveness related ambiguities. So in preparation to the above fix
`sstable_set` was updated to work in terms of ring-position instead of
tokens. Ring-position can express positions in a much more fine-grained
way then token, including positions after/before tokens and keys. This
allows for a clear expression of `next_position` such that calling
`select()` with it guarantees forward progress in the token-space.
Tests: unit(release, debug)
Refs: #3513
"
* 'leveled-missing-keys/v4' of https://github.com/denesb/scylla:
tests/mutation_reader_test: combined_mutation_reader_test: use SEASTAR_THREAD_TEST_CASE
tests/mutation_reader_test: refactor combined_mutation_reader_test
tests/mutation_reader_test: fix reader_selector related tests
Revert "database: stop using incremental selectors"
incremental_reader_selector: don't jump over sstables
mutation_reader: reader_selector: use ring_position instead of token
sstables_set::incremental_selector: use ring_position instead of token
compatible_ring_position: refactor to compatible_ring_position_view
dht::ring_position_view: use token_bound from ring_position
i_partitioner: add free function ring-position tri comparator
mutation_reader_merger::maybe_add_readers(): remove early return
mutation_reader_merger: get rid of _key
We break build_ami.sh since we dropped Ubuntu support, scylla_current_repo
command does not finishes because of less argument ('--target' with no
distribution name, since $TARGET is always blank now).
It need to hardcoded as centos.
Fixes#3577
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705035251.29160-1-syuu@scylladb.com>
When test.py is run with --jenkins flag Boost UTF is asked to generate
an XML file with the test results. This automatically disables the
human-readable output printed to stdout. There is no real reason to do
so and it is actually less confusing when the Boost UTF messages are in
the test output together with Scylla logger messages.
Message-Id: <20180704172913.23462-1-pdziepak@scylladb.com>
Function to reevaluate postponed compaction was called too early for strategies that
don't allow parallel compaction (only leveled strategy (LCS) at this moment).
Such strategies must first have the ongoing compaction deregistered before reevaluating
the postponed ones. Manager uses task list of ongoing compaction to decides if there's
ongoing compaction for a given column family. So compaction could stop making progress
at all *if and only if* we stop flushing new data.
So it could happen that a column family would be left with lots of pending compaction,
leading the user to think all compacting is done, but after reboot, there will be
lots of compaction activity.
We'll both improve method to detect parallel compaction here and also add a call to
reevaluate postponed compaction after compaction is done.
Fixes#3534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180702185327.26615-1-raphaelsc@scylladb.com>
Make combined_mutation_reader_test more interesting:
* Set the levels on the sstables
* Arrange the sstables so that they test for the "jump over sstables"
bug.
* Arrange the sstables so that they test for the "gap between sstables".
While at it also make the code more compact.
Passing the current read position to the
`incremental_selector::select()` can lead to "jumping" through sstables.
This can happen when the currently open sstables have no partition that
intersects with a yet unselected sstable that has an intersecting range
nevertheless, in other words there is a gap in the selected sstables
that this unselected one completely fits into. In this case the
unselected sstable will be completely omitted from the read.
The solution is to not to avoid calling `select()` with a position that
is larger than the `next_position` returned from the previous `select()`
call. Instead, call `select()` repeatedly with the `next_position` from
the previous call, until either at least one new sstable is selected or
the current read position is surpassed. This guarantess that no sstables
will be jumped over. In other words, advance the incremental selector in
a pace defined by itself thus guaranteeing that no sstable will be
jumped over.
sstable_set::incremental selector was migrated to ring position, follow
suit and migrate the reader_selector to use ring_position as well. Above
correctness this also improves efficiency in case of dense tables,
avoiding prematurely selecting sstables that share the token but start
at different keys, altough one could argue that this is a niche case.
Currently `sstable_set::incremental_selector` works in terms of tokens.
Sstables can be selected with tokens and internally the token-space is
partitioned (in `partitioned_sstable_set`, used for LCS) with tokens as
well. This is problematic for severeal reasons.
The sub-range sstables cover from the token-space is defined in terms of
decorated keys. It is even possible that multiple sstables cover
multiple non-overlapping sub-ranges of a single token. The current
system is unable to model this and will at best result in selecting
unnecessary sstables.
The usage of token for providing the next position where the
intersecting sstables change [1] causes further problems. Attempting to
walk over the token-space by repeatedly calling `select()` with the
`next_position` returned from the previous call will quite possibly lead
to an infinite loop as a token cannot express inclusiveness/exclusiveness
and thus the incremental selector will not be able to make progress when
the upper and lower bounds of two neighbouring intervals share the same
token with different inclusiveness e.g. [t1, t2](t2, t3].
To solve these problems update incremental_selector to work in terms of
ring position. This makes it possible to partition the token-space
amoing sstables at decorated key granularity. It also makes it possible
for select() to return a next_position that is guaranteed to make
progress.
partitioned_sstable_set now builds the internal interval map using the
decorated key of the sstables, not just the tokens.
incremental_selector::select() now uses `dht::ring_position_view` as
both the selector and the next_position. ring_position_view can express
positions between keys so it can also include information about
inclusiveness/exclusiveness of the next interval guaranteeing forward
progress.
[1] `sstable_set::incremental_selector::selection::next_position`
"
In the same way that drivers can route requests to a coordinator that
is also a replica of the data used by the request, we can allow
drivers to route requests directly to the shard. This patchset
adds and documents a way for drivers to know which shard a connection
is connected to, and how to perform this routing.
"
* tag 'shard-info-alt/v1' of https://github.com/avikivity/scylla:
doc: documented protocol extension for exposing sharding
transport: expose more information about sharding via the OPTIONS/SUPPORTED messages
dht: add i_partitioner::sharding_ignore_msb()
Use query::is_single_partition() to check whether the queried ranges are
singular or not. The current method of using
`dht::partition_range::is_singular()` is incorrect, as it is possible to
build a singular range that doesn't represent a single partition.
`query::is_single_partition()` correctly checks for this so use it
instead.
Found during code-review.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f671f107e8069910a2f84b14c8d22638333d571c.1530675889.git.bdenes@scylladb.com>
Use is_debian()/is_ubuntu() to detect target distribution, also install
pystache by path since package name is different between Fedora and
CentOS.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703193224.4773-1-syuu@scylladb.com>
Primary key restrictions sometimes require filtering. These functions
return true if ALLOW FILTERING needs to be enabled in order to satisfy
these restrictions.
Currently restriction::is_satisfied_by() accepts only keys and rows
as arguments. In this commit, a version that only takes bytes of data
is provided.
This simpler version applies to single_column_restriction only,
because it compares raw bytes underneath anyway. For other restriction
types, simplified is_satisfied_by is not defined.
compatible_ring_position's sole purpose is to allow creating
boost::icl::interval_map with dht::ring_position as the key and list of
sstables as the value. This function is served equally well if
compatible_ring_position wraps a `dht::ring_position_view` instead of a
`dht::ring_position` with the added benefit of not having to copy the
possibly heavy `dht::decorated_key` around. It also makes it possible
to do lookups with `dht::ring_position_view` which is much more
versatile and allows avoiding copies just to make lookups.
The only downside is that `dht::ring_position_view` requires the
lifetime of the "viewed" object to be taken care of. This is not a
concern however, as so long as an interval is present in the map the
represented sstable is guaranteed to be alive to, as the interval map
participates in the ownership of the stored sstables.
Rename compatible_ring_position to compatible_ring_position_view to
reflect the changes.
While at it upgrade the std::experimental::optional to std::optional.
Currently dht::ring_position_view's dht::token constructor takes the
token bound in the form of a raw `uint8_t`. This allows for passing a
weight of "0" which is illegal as single token does not represent a
single ring position but an interval as arbitrary number of keys can
have the same token. dht::ring_position uses an enum in its dht::token
constructor. Import that same enum into the dht::ring_position_view
scope and take a `token_bound` instead of `uint8_t`.
This is especially important as in later patches the internal weight of
the ring_position_view will be exposed and illegal values can cause all
sorts of problems.
Gentoo Linux was not supported by the node_health_check script
which resulted in the following error message displayed:
"This s a Non-Supported OS, Please Review the Support Matrix"
This patch adds support for Gentoo Linux while adding a TODO note
to add support for authenticated clusters which the script does
not support yet.
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180703124458.3788-1-ultrabug@gentoo.org>
After the transition to the new in-memory representation in
aab6b0ee27 'Merge "Introduce new in-memory
representation for cells" from Paweł'
atomic_cell_or_collection::external_memory_usage() stopped accounting
for the externally stored data. Since, it wasn't covered by the unit
tests the bug remained unnotices until now.
This series fixes the memory usage calculation and adds proper unit
tests.
* https://github.com/pdziepak/scylla.git fix-external-memory-usage/v1:
tests/mutation: properly mark atomic_cells that are collection members
imr::utils::object: expose size overhead
data::cell: expose size overhead of external chunks
atomic_cell: add external chunks and overheads to
external_memory_usage()
tests/mutation: test external_memory_usage()
A bug was found recently (#3564) in the paging logic, where the code
assumed the queried ranges list is non-empty. This assumption is
incorrect as there can be valid (if rare) queries that can result in the
ranges list to be empty. Add a unit test that executes such a query with
paging enabled to detect any future bugs related to assumptions about
the ranges list being non-empty.
Refs: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f5ba308c4014c24bb392060a7e72e7521ff021fa.1530618836.git.bdenes@scylladb.com>
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.
Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
Debug mode is so slow that generating 1000 mutations is too much for it.
High memory use can also confuse the santitizers that track each allocation.
Reduce mutation count from 1000 to 10 in debug mode.
"
Added NIC / Disk existance check, --force-raid mode on
scylla_raid_setup.
"
* 'scylla_setup_fix4' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_raid_setup: verify specified disks are unused
dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
dist/common/scripts/scylla_sysconfig_setup: verify NIC existance
Currently only scylla_setup interactive mode verifies selected disks are
unused, on non-interactive mode we get mdadm/mkfs.xfs program error and
python backtrace when disks are busy.
So we should verify disks are unused also on scylla_raid_setup, print
out simpler error message.
scylla_install_pkg is initially written for one-liner-installer, but now
it only used for creating AMI, and it just few lines of code, so it should be
merge into scylla_install_ami script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-2-syuu@scylladb.com>
`_key` is only used in a single place and this does not warrant storing
it in a member. Also get rid of current_position() which was used to
query `_key`.
"
I found problems on previously submmited patchset 'scylla_setup fixes'
and 'more fixes for scylla_setup', so fixed them and merged into one
patchset.
Also added few more patches.
"
* 'scylla_setup_fix3' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
dist/common/scripts/scylla_raid_setup: fix module import
dist/common/scripts/scylla_setup: check disk is used in MDRAID
dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
dist/common/scripts/scylla_setup: use print() instead of logging.error()
dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
dist/common/scripts/scylla_setup: add a disk to selected list correctly
dist/common/scripts/scylla_setup: fix wrong indent
dist/common/scripts: sync instance type list for detect NIC type to latest one
dist/common/scripts: verify systemd unit existance using 'systemctl cat'
"
If a coordinator sends write requests with ID=X and restarts it may get a reply to
the request after it restarts and sends another request with the same ID (but to
different replicas). This condition will trigger an assert in a coordinator. Drop
the assertion in favor of a warning and initialize handler id in a way to make
this situation less likely.
Fixes: #3153
"
* 'gleb/write-handler-id' of github.com:scylladb/seastar-dev:
storage_proxy: initialize write response id counter from wall clock value
storage_proxy: drop virtual from signal(gms::inet_address)
storage_proxy: do not assert on getting an unexpected write reply
Initializing write response id to the same value on each reboot may
cause stale id to be taken for active one if node restarts after
sending only a couple of write request and before receiving replies.
On next reboot it will start assigning id's from the same value and
receiving old replies will confuse it. Mitigate this by assigning
initial id to wall clock value in milliseconds. It will not solve the
problem completely, but will mitigate it.
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.
Fixes#3557.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
In theory we should not get write reply from a node we did not send
write to, but in practice stale reply can be received if node reboot
between sending write and getting a reply. Do not assert, but log the
warning instead and ignore the reply.
Fixes: #3153
Introduced in 5b59df3761.
It is incorrect to erase entries from the memtable being moved to
cache if partition update can be preempted because a later memtable
read may create a snapshot in the memtable before memtable writes for
that partition are made visible through cache. As a result the read
may miss some of the writes which were in the memtable. The code was
checking for presence of snapshots when entering the partition, but
this condition may change if update is preempted. The fix is to not
allow erasing if update is preemptible.
This also caused SIGSEGVs because we were assuming that no such
snapshots will be created and hence were not invalidating iterators on
removal of the entries, which results in undefined behavior when such
snapshots are actually created.
Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test
Fixes#3532
Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>
"
Partition snapshots go away when the last read using the snapshot is done.
Currently we will synchronously attempt to merge partition versions on this event.
If partitions are large, that may stall the reactor for a significant amount of time,
depending on the size of newer versions. Cache update on memtable flush can
create especially large versions.
The solution implemented in this series is to allow merging to be preemptable,
and continue in the background. Background merging is done by the mutation_cleaner
associated with the container (memtable, cache). There is a single merging process
per mutation_cleaner. The merging worker runs in a separate scheduling group,
introduced here, called "mem_compaction".
When the last user of a snapshot goes away the snapshot is slided to the
oldest unreferenced version first so that the version is no longer reachable
from partition_entry::read(). The cleaner will then keep merging preceding
(newer) versions into it, until it merges a version which is referenced. The
merging is preemtable. If the initial merging is preempted, the snapshot is
enqueued into the cleaner, the worker woken up, and merging will continue
asynchronously.
When memtable is merged with cache, its cleaner is merged with cache cleaner,
so any outstanding background merges will be continued by the cache cleaner
without disruption.
This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench
shows a similar improvement.
"
* tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla:
tests: perf_row_cache_update: Test with an active reader surviving memtable flush
memtable, cache: Run mutation_cleaner worker in its own scheduling group
mutation_cleaner: Make merge() redirect old instance to the new one
mvcc: Use RAII to ensure that partition versions are merged
mvcc: Merge partition version versions gradually in the background
mutation_partition: Make merging preemtable
tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots
Tests testing different aspects of `foreign_reader` and
`multishard_combining_reader` are designed to run with a certain minimum
shard count. Running them with any shard count below this minimum makes
them useless at best but can even fail them.
Refuse to run these tests when the shard count is below the required
minimum to avoid an accidental and unnecessary investigation into a
false-positive test failure.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d24159415b6a9d74eafb8355b6e3fba98c1ff7ff.1530274392.git.bdenes@scylladb.com>
The index_reader class public interface has been amended to only deal
with the upper bound cursor along with advancing the lower bound.
Since the class users can only explicitly operate with the lower bound
cursor (take data file position, advance to the next partition, etc), it
no longer makes sense to specify that the method operates on the lower
bound cursor in its name.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The only case when it needs to be called is when an index_reader is
advanced to a specific partition as part of sstable_reader
initialisation.
Instead, we're passing an optional upper_bound parameter that is used to
call advance_upper_past() internally if partition is found.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
When frozen mutation gets deserialised current implementation copies
its value 3 times: from IDL buffer to bytes object, from bytes object to
atomic_cell and then atomic_cell is copied again. Moreover, the value
gets linearised which may cause a large allocation.
All of that is very wasteful. This patch devirtualises and reworks IDL
reading code so that when used with partition_builder the cell value is
copied only once and without linearisation: from the IDL buffer to the
final atomic_cell.
perf_simple_query -c4, medians of 30 results:
./perf_before ./perf_after diff
read 310576.54 316273.90 1.8%
write 359913.15 375579.44 4.4%
microbenchmark, perf_idl:
BEFORE
test iterations median mad min max
frozen_mutation.freeze_one_small_row 2142435 462.431ns 0.125ns 462.306ns 467.659ns
frozen_mutation.unfreeze_one_small_row 1640949 601.422ns 0.082ns 601.340ns 605.279ns
frozen_mutation.apply_one_small_row 1538969 645.993ns 0.405ns 645.588ns 656.510ns
AFTER
test iterations median mad min max
frozen_mutation.freeze_one_small_row 2139548 455.525ns 0.631ns 454.894ns 456.707ns
frozen_mutation.unfreeze_one_small_row 1760139 566.157ns 0.003ns 566.153ns 584.339ns
frozen_mutation.apply_one_small_row 1582050 610.951ns 0.060ns 610.891ns 613.044ns
Tests: unit(release)
"
* tag 'avoid-copy-unfreeze/v2' of https://github.com/pdziepak/scylla:
mutation_partition_view: use column_mapping_entry::is_atomic()
schema: column_mapping_entry: cache abstract_type::is_atomic()
schema: column_mapping_entry: reduce logic duplication
mutation_partition_view: do not linearise or copy cell value
atomic_cell: allow passing value via ser::buffer_view
mutation_partition_view: pass cell by value to visitor
mutation_partition_view: devirtualise accept()
storage_proxy: use mutation_partition_view::{first, last}_row_key()
mutation_partition_view: add last_row_key() and first_row_key() getters
IDL deserialisation code calls is_atomic() for each cell. An additional
indirection and a virtual call can be avoided by caching that value in
column_mapping_entry. There is already very similar optimisation done
for column_definitions.
User-defined constructors often make it more likely that a careless
developer will forget to update one of them when adding a new member to
a structure. The risk of that happening can be reduced by reducing code
duplication with delegating constructors.
mutation_partition_view needs to create an atomic_cell from
IDL-serialised data. Then that cell is passed to the visitor. However,
because generic mutation_partition_visitor interface was used, the cell
was passed by constant reference which forced the visitor to needlessly
copy it.
This patch takes advantage of the fact that mutation_partition_view is
devirtualised now and adjust the interfaces of its visitors so that the
cell can be passed without copying.
There are only two types of visitors used and only one of them appears
in the hot path. They can be devirtualised without too much effort,
which also enables future custom interface specialisations specific to
mutation_partition_views and its users, not necessairly in the scope of
more general mutation_partition_visitor.
Some users (e.g. reconciliation code) need only to know the clustering
key of the first or the last row in the partition. This was done with a
full visitor visiting every single cell of the partition, which is very
wasteful. This patch adds direct getters for the needed information.
It is only being used by index_reader internally and never exposed so
should not be listed in commonly used types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
An index entry may or may not have a promoted index. All the optional
fields are better scoped under the same class to avoid lots of separate
optional fields and give better representation.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
In the removenode operation, if the message servicing is stopped, e.g., due
to disk io error isolation, the node can keep retrying the
REPLICATION_FINISHED verb infinitely.
Scylla log full of such message was observed:
[shard 0] storage_service - Fail to send REPLICATION_FINISHED to $IP:0:
seastar::rpc::closed_error (connection is closed)
To fix, limit the number of retires.
Tests: update_cluster_layout_tests.py
Fixes#3542
Message-Id: <638d392d6b39cc2dd2b175d7f000e7fb1d474f87.1529927816.git.asias@scylladb.com>
Currently, enabling scylla-fstrim.timer is part of 'enable-service', it
will be enabled even --no-fstrim-setup specified (or input 'No' on interactive setup prompt).
To apply --no-fstrim-setup we need to enabling scylla-fstrim.timer in
scylla_fstrim_setup instead of enable-service part of scylla_setup.
Fixes#3248
On current implementation, we are checking the partition is mounted, but
a disk contains the partition marked as unused.
To avoid the problem, we should skip a disk which contains partitions.
Fixes#3545
When a disk path typed on the RAID setup prompt, the script mistakenly
splits the input for each character,
like ['/', 'd', 'e', 'v', '/', 's', 'd', 'b'].
To fix the issue we need to use selected.append() instead of
selected +=.
See #3545
The presence of a plain reference prohibits the bound_view class from
being copyable. The trick employed to work around that was to use
'placement new' for copy-assigning bound_view objects, but this approach
is ill-formed and causes undefined behaviour for classes that have const
and/or reference members.
The solution is to use a std::reference_wrapper instead.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <a0c951649c7aef2f66612fc006c44f8a33713931.1530113273.git.vladimir@scylladb.com>
"
We need a multishard_writer which gets mutation fragments from a producer
(e.g., from the network using the rpc streaming) and consumes the mutation
fragments with a consumer (e.g., write to sstable).
The multishard_writer will take care of the mutation fragments do not belong to
current shard.
This multishard_writer will be used in the new scylla streaming.
"
* 'asias/multishard_writer_v10.1' of github.com:scylladb/seastar-dev:
tests: Add multishard_writer_test to test.py
tests: Add test for multishard_writer
multishard_writer: Introduce multishard_writer
tests: Allow random_mutation_generator to generate mutations belong to remote shrard
The multishard_writer class gets mutation_fragments generated from
flat_mutation_reader and consumes the mutation_fragments with
multishard_writer::_consumer. If the mutation_fragment does not belong to the
shard multishard_writer is on, it will forward the mutation_fragment to the
correct shard. Future returned by multishard_writer() becomes ready
when all the mutation_fragments are consumed.
Tests: tests/multishard_writer_test.cc
Tests: dtest update_cluster_layout_tests.py
Fixes#3497
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".
We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
If memtable snapshot goes away after memtable started merging to
cache, it would enqueue the snapshots for cleaning on the memtable's
cleaner, which will have to clean without deferrring when the memtable
is destroyed. That may stall the reactor. To avoid this, make merge()
cause the old instance of the cleaner to redirect to the new instance
(owned by cache), like we do for regions. This way the snapshots
mentioned earlier can be cleaned after memtable is destroyed,
gracefully.
Before this patch, maybe_merge_versions() had to be manually called
before partition snapshot goes away. That is error prone and makes
client code more complicated. Delegate that task to a new
partition_snapshot_ptr object, through which all snapshots are
published now.
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.
There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.
Disable until a more elaborate fix is implemented.
Fixes#3552Fixes#3553
"
* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
tests: Add test for slicing a mutation source with date tiered compaction strategy
tests: Check that database conforms to mutation source
database: Disable sstable filtering based on min/max clustering key components
Fixes#3546
Both older origin and scylla writes "known" compressor names (i.e. those
in origin namespace) unqualified (i.e. LZ4Compressor).
This behaviour was not preserved in the virtualization change. But
probably should be.
Message-Id: <20180627110930.1619-1-calle@scylladb.com>
When snapshots go away, typically when the last reader is destroyed,
we used to merge adjacent versions atomically. This could induce
reactor stalls if partitions were large. This is especially true for
versions created on cache update from memtables.
The solution is to allow this process to be preempted and move to the
background. mutation_cleaner keeps a linked list of such unmerged
snapshots and has a worker fiber which merges them incrementally and
asynchronously with regards to reads.
This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from 23ms to 2ms.
drop_column_family now waits for both writes and reads in progress.
It solves possible liveness issues with row cache, when column_family
could be dropped prematurely, before the read request was finished.
Phaser operation is passed inside database::query() call.
There are other places where reading logic is applied (e.g. view
replicas), but these are guarded with different synchronization
mechanisms, while _pending_reads_phaser applies to regular reads only.
Fixes#3357
Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <d58a5ee10596d0d62c765ee2114ac171b6f087d2.1529928323.git.sarna@scylladb.com>
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.
There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.
Disable until a more elaborate fix is implemented.
Fixes#3552Fixes#3553
"
Cache tracker is a thread-local global object that indirectly depends on
the lifetimes of other objects. In particular, a member of
cache_tracker: mutation_cleaner may extend the lifetime of a
mutation_partition until the cleaner is destroyed. The
mutation_partition itself depends on LSA migrators which are
thread-local objects. Since, there is no direct dependency between
LSA-migrators and cache_tracker it is not guarantee that the former
won't be destroyed before the latter. The easiest (barring some unit
tests that repeat the same code several billion times) solution is to
stop using globals.
This series also improves the part of LSA sanitiser that deals with
migrators.
Fixes#3526.
Tests: unit(release)
"
* tag 'deglobalise-cache-tracker/v1-rebased' of https://github.com/pdziepak/scylla:
mutation_cleaner: add disclaimer about mutation_partition lifetime
lsa: enhance sanitizer for migrators
lsa: formalise migrator id requirements
row_cache: deglobalise row cache tracker
This works around a problem of std::terminate() being called in debug
mode build if initialization of _current throws.
Backtrace:
Thread 2 "row_cache_test_" received signal SIGABRT, Aborted.
0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
#1 0x00007ffff17d077d in abort () from /lib64/libc.so.6
#2 0x00007ffff5773025 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007ffff5770c16 in ?? () from /lib64/libstdc++.so.6
#4 0x00007ffff576fb19 in ?? () from /lib64/libstdc++.so.6
#5 0x00007ffff5770508 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#6 0x00007ffff3ce4ee3 in ?? () from /lib64/libgcc_s.so.1
#7 0x00007ffff3ce570e in _Unwind_Resume () from /lib64/libgcc_s.so.1
#8 0x0000000003633602 in reader::reader (this=0x60e0001160c0, r=...) at flat_mutation_reader.cc:214
#9 0x0000000003655864 in std::make_unique<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (__args#0=...)
at /usr/include/c++/7/bits/unique_ptr.h:825
#10 0x0000000003649a63 in make_flat_mutation_reader<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (args#0=...)
at flat_mutation_reader.hh:440
#11 0x000000000363565d in make_forwardable (m=...) at flat_mutation_reader.cc:270
#12 0x000000000303f962 in memtable::make_flat_reader (this=0x61300001d540, s=..., range=..., slice=..., pc=..., trace_state_ptr=..., fwd=..., fwd_mr=...)
at memtable.cc:592
Message-Id: <1528792447-13336-1-git-send-email-tgrabiec@scylladb.com>
"
The read path on coordinator involves a lot of passing around buffers
and some occasional processing. We start with query::result obtained
from the storage_proxy which is then transformed into a
cql3::result_set, which is then used to write a response. Buffers are
copied and linearised quite excessively.
This series attempts to remedy that by using view of fragmented buffers
as much as possible. The first part deals with reading from
query::result. ser::buffer_view is introduced which enables the IDL
infrastructure to read a buffer without copying or linearising it.
The second part is switching native protocol layer to use bytes_ostream
instead of std::vector<char> to hold the generated response to the
client. The last part introduces cql3::result_generator which is an
alternative to cql3::result_set that passes buffer views without copying
or linearising anything from query::result to the native protocl layer
(or Thrift). It is only used in simple cases, when no processing at the
CQL layer is required, except for paged queries which require some
simple interpretation of the results and are supported by the result
generator.
Tests: unit(release), dtests(paging_test.py paging_additional_test.py
cql_additional_tests.py cql_tracing_test.py cql_prepared_test.py
cql_cast_test.py cql_tests.py)
"
* tag 'buffer-views-query-result/v2' of https://github.com/pdziepak/scylla: (34 commits)
cql3: select_statement: use fetch_page_generator() if possible
pager: add fetch_page_generator()
pager: make the visitor handle_result() accepts a template parameter
pager: make query_result_visitor base class a template parameter
pager: make myvistor a member class of query_pager
pager: make shared pointers to selection constant
pager: merge query_pager and query_pagers::impl
cql3: select_statement: use result_generator if possible
cql3: selection: add is_trivial()
cql3: result: support result_generator
cql3: add lazy result_generator
cql3: add result class
cql3::result_set: fix encapsulation
thrift: use cql3::result_set visiting interface
transport: use cql3::result_set visiting interface
cql3::result_set: add visit()
transport: response: add write_int_placeholder()
transport: steal response buffers and make send zero-copy
transport: use reusable_buffer for compression
transport: response: use bytes_ostream
...
mutation_cleaner has already caused problems by extending lifetime of
mutation_partition past the lifetime of LSA migrators that it uses (due
to the fact that both the cleaner and migrators where thread-local
globals). Since, the long term goal is to make mutation_partition
internal representation depend more and more on schema that lifetime
extension may again cause problems in the future, so let's add a
disclaimer that hopefuly, will help avoiding them.
Current LSA sanitizer performs only basic checks on the migrators use,
without doing any additonal reporting in case an error is detected. This
patch enhances it so that when a problem is detected relevant stack
traces get printed.
object_descriptor uses special encoding for migrator ids which assumes
that the valid ones are in a range smaller than uint32_t. Let's add some
static asserts that make this fact more visible.
Row cache tracker has numerous implicit dependencies on ohter objects
(e.g. LSA migrators for data held by mutation_cleaner). The fact that
both cache tracker and some of those dependencies are thread local
objects makes it hard to guarantee correct destruction order.
Let's deglobalise cache tracker and put in in the database class.
So far query_result_visitor was tied to result_set_builder. The goal is
to enable result_generator to work with paged queries as well so we need
to decouple them.
Shared pointers make code harder to reason about, it is not easy to get
rid of them in this piece of the code, but we can restore at least a bit
of sanity by adding consts.
There is just a single implementation of query_pager and there is no
reason to make anything virtual. Devirtualising this code will allow
higher layers to pass visitors via templates.
cql3::result can now hold either a result_set or a result_generator.
Some code that is not performance critical expects to get result_set so
a way of converting the result_generator to a result_set is added.
result_generator is a restricted alternative of result_set. It supports
only the simples cases, but is much cheaper as it passes data almost
directly from query::result to its visitor bypassing much of the CQL
layer.
So far the only way of returing a result of a CQL query was to build a
result_set. An alternative lazy result generator is going to be
introduced for the simple cases when no transformations at CQL layer are
needed. To do that we need to hide the fact that there are going to be
multiple representations of a cql results from the users.
This visiting interface for result_set satisfies most of its users (at
least all of those which are in the hot path). It will allow having an
alternative of result_set (i.e. lazy result generator) which would
provide exaclty the same interface.
This allows the response writer to defer writing integers until later
time. It will be used by lazy response generator which will know the
number of rows in the response only after they are all written.
Compression algorithms require us to linearise bytes_ostream. This may
cause an excessive number of large allocations. Using reusable_buffers
can avoid that.
std::vector<char> is not a very good container for incrementally
building a response. It may cause excessive copies and allocations. If
the response is large it will put more pressure on the memory allocator
by requiring the buffer to be contiguous.
We already have bytes_ostream which avoids all of these problems, so
let's use it.
So far cql_server::response was passed around using shared pointers.
They have very big cost of making it hard to reason about the code. All
that is not necessary and we can easily switch to using much more
sensible std::unique_ptr.
There are some other translation units which right now are satisfied
with the response being an incomplete type. This means that
std::unique_ptr can't be used for it. Let's move the class declaration
to a header that can be included where needed.
This commit adds a helper class reusable_buffer which can be used to
avoid excessive memory allocations of large buffers when bytes_ostream
needs to be linearised. The idea is that reusable_buffer in most cases
is going to be thread local so that multiple continuation chains can
reuse the same large buffer.
query::result_view already operates on views of a serialised
query::result. However, until now the value of a cell was always
linearised and copied. This patch makes use of ser::buffer_view to avoid
that.
ser::buffer_view is a view of a fragmented buffer in a stream od
IDL-serialised data. It can be used to deserialise IDL objects without
needless copying and linearisation of large blobs.
"
Converted all setup scripts from bash to python3.
"
* 'scripts_python_conversion_v1' of https://github.com/syuu1228/scylla:
dist/common/scripts: convert scylla_kernel_check to python3
dist/common/scripts: convert scylla_ec2_check to python3
dist/common/scripts: convert scylla_sysconfig_setup to python3
dist/common/scripts: convert scylla_setup to python3
dist/common/scripts: convert scylla_selinux_setup to python3
dist/common/scripts: convert scylla_raid_setup to python3
dist/common/scripts: convert scylla_ntp_setup to python3
dist/common/scripts: convert scylla_fstrim_setup to python3
dist/common/scripts: convert scylla_dev_mode_setup to python3
dist/common/scripts: convert scylla_cpuset_setup to python3
dist/common/scripts: convert scylla_cpuscaling_setup to python3
dist/common/scripts: convert scylla_coredump_setup to python3
dist/common/scripts: convert scylla_bootparam_setup to python3
dist/common/scripts: extend scylla_util.py to convert setup scripts to python3
dist/common/scripts: convert scylla_io_setup and scylla_util.py to python3
The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".
An alias is kept to avoid huge code churn.
To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.
Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
This patchset brings support for writing range tombstones to SSTables
3.x. ('mc' format).
In SSTables 3.x, range tombstones are represented by so-called range
tombstone markers (hereafter RT markers) that denote range tombstone
start and end bounds. So each range tombstone is represented in data
file by two ordered RT markers.
There are also markers that both close the previous range tombstone and
open the new one in case if two range tombstones are ajdacent. This is
done to consume less disk space on such occasions.
Range tombstones written as RT markers are naturally non-overlapping.
* github.com:argenet/scylla projects/sstables-30/write-range-tombstones/v6
range_tombstone_stream: Remove an unused boolean flag.
Revert "Add missing enum values to bound_kind."
sstables: Move to_deletion_time helper up and make it static.
sstables: Write end-of-partition byte before flushing the last index
block.
sstables: Add support for writing range tombstones in SSTables 3.x
format.
tests: Add unit test covering simple range tombstone.
tests: Add unit test covering adjacent range tombstones.
tests: Add test to cover non-adjacent RTs.
tests: Add test covering mixed rows and range tombstones.
tests: Add test covering SSTables 3.x with many RTs.
tests: Add unit test covering overlapping RTs and rows.
tests: Add tests writing a range tombstone and a row overlapping with
its start.
tests: Add tests writing a range tombstone and a row overlapping with
its end.
tests: Add function that writes from multiple memtable into SSTables.
tests: Add test where 2nd range tombstone covers the remainder of the
1st one.
tests: Add test writing two non-adjacent range tombstones with same
clustering key prefix at their bounds.
tests: Add test covering overlapped range tombstones.
"
This series addresses issue #3516 and enhances space watchdog to make it
device-aware. It's needed because since last MV-related changes, space
watchdog can be responsible for multiple hints manager, which means
multiple directories, which may mean multiple devices.
Hence, having a single static space size limit is not enough anymore
and watchdog should take it into account that different managers
may work on different disks, while yet another managers can share
the same device.
Tests: unit (release)
"
* 'enhance_space_watchdog_4' of https://github.com/psarna/scylla:
hints: reserve more space for dedicated storage
hints: add is_mountpoint function
hints: make space_watchdog device-aware
hints: add device_id to manager
hints: add get_device_id function
Reserving 10% of space for hints managers makes sense if the device
is shared with other components (like /data or /commitlog).
But, if hints directory is mounted on a dedicated storage, it makes
sense to reserve much more - 90% was chosen as a sane limit.
Whether storage is 'dedicated' or not is based on a simple check
if given hints directory is a mount point.
Fixes#3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.
References #3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
In order to distinguish which directories reside on which devices,
get_device_id function is added to resource manager.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
To porting setup scripts to python3, following utility functions/classes
introduced:
- run(): execute command line, returns return code
- out(): execute command line, returns stdout as string
- is_debian_variant() / is_redhat_variant() / is_gentoo_variant()
/ is_ec2() / is_systemd(): detect specific environment
- hex2list(): implement hex2list.py code as a function
- makedirs(): same as os.makedirs() but do nothing when dir is exists
- dist_name() / dist_ver(): alias of platform.dist()
- class systemd_unit: an utility to control systemd unit using systemctl
- class sysconfig_parser: reader/writer of /etc/sysconfig files
- class concolor: ANSI color escape sequences list
Very often people use the issue tracker to just ask questions. We have
been telling them to close the bug and move the discussion somewhere
else but it would be better if people were already directed to the right
place before they even get it wrong.
This would be easier to everybody.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180621135051.3254-1-glauber@scylladb.com>
This comes in handy when we want to test overlapping range tombstones
because memtable would otherwise de-overlap them internally.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Tests three cases:
- a row lying inside a range tombstone
- a row that has the same clustering key as range tombstone start
- a row that has the same clustering key as range tombstone end
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
These are two RTs where one's RT end clustering is the same as another
one's RT start bound but they are both exclusive.
In this case those bounds should not (and cannot) be merged into a
single RT boundary when writing RT markers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For SSTables 3.x. ('mc' format), range tombstones are represented by
their bounds that are written to the data file as so-called RT markers.
For adjacent range tombstones, an RT marker can be of a 'boundary' type
which means it closes the previous range tombstone and opens the new
one.
Internally, sstable_writer_m relies on range_tombstone_stream to both
de-overlap incoming range tombstones and order them so that when they
are drained they can be easily thought of as just pairs of their bounds.
By default Scylla docker runs without the security features.
This patch adds support for the user to supply different params values for the
authenticator and authorizer classes and allowing to setup a secure Scylla in
Docker.
For example if you want to run a secure Scylla with password and authorization:
docker run --name some-scylla -d scylladb/scylla --authenticator
PasswordAuthenticator --authorizer CassandraAuthorizer
Update the Docker documentation with the new command line options.
Signed-off-by: Noam Hasson <noam@scylladb.com>
Message-Id: <20180620122340.30394-1-noam@scylladb.com>
On current .bash_profile it prints "Constructing RAID volume..." when
scylla_ami_setup is still running, even it running on unsupported
instance types.
To avoid that we need to run instance type check at first, then we can
run rest of the script.
Fixes#2739
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613111539.30517-1-syuu@scylladb.com>
"
Make sure we properly handle row marker and row tombstone
when reading a row.
Tests: unit {release}
"
* 'haaawk/sstables3/read-liveness-info-v4' of ssh://github.com/scylladb/seastar-dev:
sstable: consume row marker in data_consume_rows_context_m
sstable: Add consumer_m::consume_row_marker_and_tombstone
sstable: add is_set and to_row_marker to liveness_info
* https://github.com/vladzcloudius/scylla.git tracing_prepared_parameters-v6:
cql3::query_options: add get_names() method
tracing::trace_state: hide the internals of params_values
tracing: store queries statements for BATCH
tracing: store the prepared statements parameters values
"
A few fixes in scripts that were found when debugging #3508.
This series fixed this issue.
"
Fixes#3508
* 'ami_scripts_fixes-v1' of https://github.com/vladzcloudius/scylla:
scylla_io_setup: properly define the disk_properties YAML hierarchy
scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
scylla_io_setup: print the io_properties.yaml file name and not its handle info
scylla_lib.sh: tolerate perftune.py errors
"
We are seeing some workloads with large datasets where the compaction
controller ends up with a lot of shares. Regardless of whether or not
we'll change the algorithm, this patchset handles a more basic issue,
which is the fact that the current controller doesn't set a maximum
explicitly, so if the input is larger than the maximum it will keep
growing without bounds.
It also pushes the maximum input point of the compaction controller from
10 to 30, allowing us to err on the side of caution for the 2.2 release.
"
* 'tame-controller' of github.com:glommer/scylla:
controller: do not increase shares of controllers for inputs higher than the maximum
controller: adjust constants for compaction controller
"
This mini series fixes some querier-cache related issues discovered
while working on stateful range-scans.
1) A problem in the memory based cache eviction test that is is yet
unexposed (#3529).
2) Possible usage of invalidated iterators in querier_cache (#3424).
3) lookup() possibly returning a querier with the wrong read range
(#3530).
Tests: unit(release)
"
* 'fix-querier-cache-invalid-iterators-master' of https://github.com/denesb/scylla:
querier: find_querier(): return end() when no querier matches the range
querier_cache: restructure entries storage
tests/querier_cache: fix memory based eviction test
batch_statement::verify_batch_size() verifies that the total size of
mutations generated by the batch statement is smaller than certain
configurable thresholds. This is done by a custom mutation_partition
visitor, which violates atomic_cell_view::value() preconditions by
calling it even for dead cells.
The simples solution is to use
mutation_partition::external_memory_usage() instead.
Message-Id: <20180619131405.12601-1-pdziepak@scylladb.com>
When dropping a table, wait for the column family to quiesce so that
no pending writes compete with the truncate operation, possibly
allowing data to be left on disk.
Fixes#2562
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180618193134.31971-1-duarte@scylladb.com>
Patch f39891a999 fixed 3443,
but also introduced a regression in dtest - new column
was unconditionally added to view during ALTER TABLE ADD,
while it should only be the case for "include all columns" views.
This patch fixes the regression (spotted by query_new_column_test).
References #3443
Message-Id: <7410d965255a514d78cf0ce941a3236b9d8ddbbd.1529399135.git.sarna@scylladb.com>
When none of the queriers found for the lookup key match the lookup
range `_entries.end()` should be returned as the search failed. Instead
the iterator returned from the failed `std::find_if()` is returned
which, if the find failed, will be the end iterator returned by the
previous call to `_entries.equal_range()`. This is incorrect because as
long as `equal_range()`'s end iterator is not also `_entries.end()` the
search will always return an iterator to a querier regardless of whether
any of them actually matches the read range.
Fix by returning `_entries.end()` when it is detected that no queriers
match the range.
Fixes: #3530
Currently querier_cache uses a `std::unordered_map<utils::UUID, querier>`
to store cache entries and an `std::list<meta_entry>` to store meta
information about the querier entries, like insertion order, expiry
time, etc.
All cache eviction algorithms use the meta-entry list to evict entries
in reverse insertion order (LRU order). To make this possible
meta-entries keep an iterator into the entry map so that given a
meta-entry one can easily erase the querier entry. This however poses a
problem as std::unordered_map can possibly invalidate all its iterators
when new items are inserted. This is use-after-free waiting to happen.
Another disadvantages of the current solution is that it requires the
meta-entry to use a weak pointer to the querier entry so that in case
that is removed (as a result of a successful lookup) it doesn't try to
access it. This has an impact on all cache eviction algorithms as they
have to be prepared to deal with stale meta-entries. Stale meta-entries
also unnecesarily consume memory.
To solve these problems redesign how querier_cache stores entries
completely. Instead of storing the entries in an `std::unordered_map`
and storing the meta-entries in an `std::list`, store the entries in an
`std::list` and an intrusive-map (index) for lookups. This new design
has severeal advantages over the old one:
* The entries will now be in insert order, so eviction strategies can
work on the entry list itself, no need to involve additional data
structures for this.
* All data related to an entry is stored in one place, no data
duplication.
* Removing an entry automatically removes it from the index as intrusive
containers support auto unlink. This means there is no need to store
iterators for long terms, risking use-after-free when the container
invalidates it's iterators.
Additional changes:
* Modify eviction strategies so that they work with the `entry`
interface rather than the stored value directly.
Ref #3424
Do increment the key counter after inserting the first querier into the
cache. Otherwise two queriers with the same key will be inserted and
will fail the test. This problem is exposed by the changes the next
patches make to the querier-cache but will be fixed before to maintain
bisectability of the code.
Fixes: #3529
This is to stay compliant with the Origin for SSTables 3.x.
It differs from SSTables 2.x (ka/la) as for those the last promoted
index block is pushed first and the end-of-partition byte is written
after.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Right now there is no limit to how much the shares of the controllers
can grow. That is not a big problem from the memtable flush controller,
since it has a natural maximum in the dirty limit.
But the compaction controller, the way it's written today, can grow
forever and end up with a very large value for shares. We'll cap that at
adjust() time by not allowing shares to grow indefinitely.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the controller adjusts its shares based on how big the backlog
is in comparison to shard memory. We have seen in some tests that if the
dataset becomes too big, this may cause compactions to dominate.
While we may change the input altogether in future versions, I'd like to
propose a quick change for the time being: move the high point from 10x
memory size to 30x memory size. This will cause compactions to increase
in shares more slowly.
While this is as magic as the 10 before, they will allow us to err in
the side of caution, with compactions not becoming aggressive enough to
overly disrupt workloads.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This patchset runs the protocol servers under the "statement" scheduling
group, and makes all execution_stages in that path scheduling aware.
I used inheriting_concrete_execution_stage instead of passing the
scheduling group to concrete_execution_stage's constructor for two
reasons:
1. For cql statements, there is no easily accessible object that
can host the concrete_execution_stage and be reached from both
main.cc and the statements,
2. In the future, we will want to assign users to different
scheduling_groups, thus providing performance isolation for
service-level agreements (SLAs). Using an inheriting
execution_stage allows us to make the scheduling_group decision
in one place.
Depends on two unmerged patches in seastar, one fixing
inheriting_concrete_execution_stage compilation with reference parameters,
and one making smp::submit_to() scheduling aware.
"
* tag 'cql-sched/v1' of https://github.com/avikivity/scylla:
cql: make modification_statement execution_stage scheduling aware
cql: make batch_statement execution_stage scheduling aware
cql: make select_statement execution_stage scheduling aware
transport: make native protocol request processing execution_stage scheduling aware
main: start client protocol servers under the statement scheduling group
* seastar e7275e4...6422ece (7):
> build: enable concepts whenever they are supported by compiler
> shared_ptr: Enable releasing ownership of the object stored in lw_shared_ptr
> reactor: change way of calculating task quota violations
> Merge "Add metrics for steal time and task quota violations" from Glauber
> bitops.hh/log2ceil(): add special case for n == 1
> circular_buffer: add clear()
> build: add core/execution_stage.{cc,hh} to core_files
"
Tests: unit (release)
Before merging the LCS controller, we merged patches that would
guarantee that LCS would move towards zero backlog - otherwise the
backlog could get too high.
We didn't do the same for STCS, our first controlled strategy. So we may
end up with a situation where there are many SSTables inducing a large
backlog, but they are not yet meeting the minimum criteria for
compaction. The backlog, then, never goes down.
This patch changes the SSTable selection criteria so that if there is
nothing to do, we'll keep pushing towards reaching a state of zero
backlog. Very similar to what we did for LCS.
"
* 'stcs-min-threshold-v4' of github.com:glommer/scylla:
STCS: bypass min_threshold unless configure to enforce strictly
compaction_strategy: allow the user to tell us if min_threshold has to be strict
If we fail to produce a SizeTiered compaction with the configured
min_threshold, we can try again to compact any two - unless there is a
global bypass telling us no to.
This will still privilege doing larger compactions in size buckets where
that is possible, but if we are idle will try to compact any two
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.
But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.
This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
Implement and test support for reading counters in SSTables 3.
"
* 'haaawk/sstables3/read-counters-v2' of ssh://github.com/scylladb/seastar-dev:
sstable_3_x_test: add test for counters
data_consume_rows_context_m: support reading counters
Add consumer_m::consume_counter_column
Extract make_counter_cell
row.hh & mp_row_consumer.hh: Add required includes
Use serialization_header::adjust in read_statistics
sstables 3: add serialization_header::adjust
data_consume_rows_context_m: add is_column_counter
data_consume_rows_context_m: Remove unused CELL_PATH_SIZE state
column_translation: add is_counter
Currently the SSTable test is failing (at least for me and Raphael),
complaining about the file it tries to write already existing. We have
helpers now to generate temporary directories, so we should use it.
The test passes after that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180614210036.16662-1-glauber@scylladb.com>
In SSTables 3, min timestamp and min deletion time in serialization
header are not stored normally but instead the difference between
their value and the cassandra "epoch" is stored.
This is supposed to make SSTables smaller. As a consequence, we have
to add the "epoch" after reading the values to obtain the actual
values of min timestamp and min deletion time.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.
Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.
Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
Previously max_shard_disk_space_size was unconditionally initialized
with the capacity of hints_directory. But, it's likely that
hints_directory doesn't exist at all if hinted handoff is not enabled,
which results in Scylla failing to boot.
So, max_shard_disk_space_size is now initialized with the capacity
of hints_for_views directory, which is always present.
This commit also moves max_shard_disk_space_size to the .cc file
where it belongs - resource_manager.cc.
Tests: unit (release)
Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>
"
This series adds the following datetime functions to CQL:
- currentTimestamp
- currentDate
- currentTime
- currentTimeUUID
- timeUUIDToDate
- timestampToDate
- timeUUIDToTimestamp
- dateToTimestamp
- timeUUIDToUnixTimestamp
- timestampToUnixTimestamp
- dateToUnixTimestamp
It also comes with datetime conversions test added to cql_query_test.
Note: issue #2949 also mentioned queries like:
$ SELECT * FROM myTable WHERE date >= currentDate() - 2d;
but it's a broader topic of supporting arithmetic operations in general,
so it's moved to #3499.
Tests: unit (release)
"
* 'support_datetime_functions_3' of https://github.com/psarna/scylla:
tests: add datetime conversions to cql_query_tests
cql3: add time conversion functions
cql3: add current* time functions
types: add time_native_type
CentOS 7.4 does support to use ambient capabilities on systemd unit
file, but on some other RHEL7 compatible enviroment doesn't, it causes
Scylla startup failure.
To avoid the issue, move AmbientCapabilities line to
/etc/systemd/system/scylla.server.service.d/, install .conf only when
both systemd and kernel supported the feature.
Fixes#3486
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613232327.7839-1-syuu@scylladb.com>
There is a bug in incremental_selector for partitioned_sstable_set, so
until it is found, stop using it.
This degrades scan performance of Leveled Compaction Strategy tables.
Fixes#3513. (as a workaround)
Introduced: 2.1
Message-Id: <20180613131547.19084-1-avi@scylladb.com>
"
After issue 3501 it turned out that IDL generates incorrect
serialization code for fragmented buffers. This series addresses
the problem by:
* providing serialization code for FragmentRange
* changing IDL generation rules for fragmented buffers, so they
expect a lower layer to iterate over fragments
* adding a test to cql_query_test suite that covers #3501
* adding a test to idl_tests suite that covers fragmented serialization
"
* 'fix_fragmented_serialization_3' of https://github.com/psarna/scylla:
tests: add fragmented serialization test to idl_tests
tests: add long text value test
idl: remove for_each from fragmented serialization
serializer: add FragmentRange serialization
Previously fragmented buffers of bytes were serialized
with a for_each loop. Since serializing bytes involves writing
size first and then data, only first fragment (and its size)
would be taken into account.
This commit changes fragmented code generation so it expects
that serialized range has a serialize(output, T) specification
and expects it to iterate over fragments on its own (just like
serializer for basic_value_view does).
Fixes#3501
Serialization for FragmentRange classes is added to serialization
suite. It first serializes total length to a 32bit field and then
writes each fragment to output.
References #3501
disk_properties map should be an entry in the 'disk' list hierarchy.
Currently this list is going to containe a single element.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
In order to get a file name from the given file() handle one should use
a file_handle.name property.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
When we check the currently configured tuning mode perftune.py is allowed
to return an error. get_tune_mode() has to be able to tolerate them.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Store the prepared statement positional parameters values in the
corresponding system_traces.sessions entry in the 'parameters' column
(which has a map<text,text> type).
Parameters are stored as a pair of "param[X]" : "value", where X is
the index of the parameter starting from 0 and the "value" is the first
64 characters of the parameter's value string representation.
If parameters were given with their names attached (see the description
on bit 0x40 of QUERY flags in the CQL binary protocol specification) then
parameters are going to be stored in the "param[X](<bound variable name>)" : "value"
form.
If the value's string representation is longer than 64 characters then the "value" will
contain only first 64 characters of it and will have the "..." at
the end.
For a BATCH of prepared statements the parameter "name" will have a form of
param[Y][X] where Y is the index of the corresponding prepared statement
in the BATCH and X is the index of the parameter. Both X and Y start from
0.
Note:
Had to switch to boost::range::find() in sstables::big_sstable_set in order to
address the "ambiguous overload" compilation error.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Similarly to the regular QUERY of EXECUTE we want to see the actual
queries statement that were part of the BATCH.
If a traced query has only a single statement to execute then its statement will be stored in a form 'query':'<statement>'.
If there are two or more queries (BATCH) then statements of each query in the BATCH will be stored in a form 'query[X]':'<statement>', where X is the index of the query in the
BATCH starting from 0.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hide it inside the trace_state.cc in order to avoid future circular
dependencies with other .hh files.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
May components limit its internal memory pools/caches/queues depending
on amount of memory present in a system. Each of them uses seastar
memory interface to get the information about memory availability
which makes it harder to 1: test the components with various memory
configurations and 2: to see which components reserve memory and how
much each one reserves.
The patch changes all the components that rely on memory size to get this
information through configuration parameter during creation instead of
checking it directly with seastar, so only main interacts with seastar
allocator.
"
* 'gleb/memory-config-v2' of github.com:scylladb/seastar-dev:
Provide available memory size to compaction_manager object during creation
Configure authorized_prepared_statment_cache memory limit during object creation
Configure logalloc memory size during initialization
Provide cql max request limit to cql server object during creation
Configure query result memory limiter size limit during object creation
Configure querier_cache size limit during object creation
Provide available memory size to messaging_service object during creation
Provide available memory size to hinted handoff resource manager during creation
Provide available memory size to storage_proxy object during creation
Provide available memory size to commitlog during creation
Provide available memory size to database object during creation
Configure prepared_statements_cache memory limit from outside
The test uses random mutations. We saw it failing with bad_alloc from time to time.
Reduce concurrency to reduce memory footprint.
Message-Id: <20180611090304.16681-1-tgrabiec@scylladb.com>
It compares only timestamps, but it should use intrinsic ordering of
the tombstone, which takes deletio ntime into consideration as well.
If we have two range tombstones with the same timestamp but different
deletion time (odd case, but still), then the one with the higher
deletion time should win. That's what all other parts of the system
use to resolve merges, in particular range_tombstone_list and
compact_mutation_state (the fragment stream compactor).
Not respecting this ordering violates the following equality:
do_compact(do_compact(m1) + m2) == do_compact(m1 + m2)
which may results in some clustered rows being missing in the
right-hand side, but not in the left-hand side, due to differences in
range tombstones.
This impacts only tests currently.
Message-Id: <1528705602-7218-1-git-send-email-tgrabiec@scylladb.com>
When I came across db/legacy_schema_migrator.cc, I had no idea what it
does and though I had obvious guesses (it somehow migrates old schemas,
right?) I didn't know what it really does. So after I figured this out,
I wrote this comment so the next person doesn't need to guess.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180605120225.25173-1-nyh@scylladb.com>
* 'systemd-coredump-debian9' of https://github.com/syuu1228/scylla:
dist/debian: fix pystache package name on Debian / Ubuntu
dist/debian: switch to systemd-coredump on Debian 9
dist/debian: rename 99-scylla.conf to 99-scylla-coredump.conf
It is faster than gossiper::is_normal because it avoids to do search in
the std::map<application_state, versioned_value>. It is useful for the
code in the fast path which needs to query if a node is in NORMAL
status.
Fixes#3500
Message-Id: <42db91fa4108f9f4fcf94fed3ec403ccf35d15e9.1528354644.git.asias@scylladb.com>
"
Implement and test support for reading collections in SSTables 3.
Tests: unit {release}
"
* 'haaawk/sstables3/read-collections-v1' of ssh://github.com/scylladb/seastar-dev:
sstables 3: Add tests for reading collections
flat_mutation_reader_assertions: add more flexible asserts
data_consume_rows_context_m: add support for collections
mp_row_consumer_m: Add support for collections
data_consume_rows_context_m: introduce cell_path
Use column_translation::*_is_collection in reading
column_translation: add *_column_is_collection()
column_flags_m: add HAS_COMPLEX_DELETION
Use read_unsigned_vint_length_bytes for COLUMN_VALUE
Use read_unsigned_vint_length_bytes for CK_BLOCKS
Implement read_unsigned_vint_length_bytes
"
This is series is for nodetool getsstables.
This patch is based on:
8daaf9833a
With some minor adjustments because of the code change in sstables.
The idea is to allow searching for all the sstables that contains a
given key.
After this patch if there is a table t1 in keyspace k1 and it has a key
called aa.
curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa"
Will return the list of sstables file names that contains that key.
"
* 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev:
Add the API implementation to get_sstables_by_key
api: column_family.json make the get_sstables_for_key doc clearer
column_family: Add the get_sstables_by_partition_key method
sstable test: add has_partition_key test
sstable: Add has_partition_key method
keys_test: add a test for nodetool_style string
keys: Add from_nodetool_style_string factory method
The get_sstables_by_partition_key method used by the API to return a set of
sstables names that holds a given partition key.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a test to the has_partition_key method, it creates an
sstable with a partition key and then used that key in the
has_partition_key method to verify that it is there.
It creates a different key and use that to verify that a non exist key
return false.
This reverts part of commit 364c2551c8. I mistakenly
changed the scylla-ami submodule in addition to applying the patch. The revert
keeps the intended part of the patch and undoes the scylla-ami change.
In 4b1034b (storage_service: Remove the stream_hints), we removed the
only user of the api with the column_families parameter.
std::vector column_families = { db::system_keyspace::HINTS };
streamer->add_tx_ranges(keyspace, std::move(ranges_per_endpoint),
column_families);
We can simplify the code range_streamer a bit by removing it.
Fixes#3476
Tests: dtest update_cluster_layout_tests.py
Message-Id: <c81d79c5e6dbc8dd78c1242837de892e39d6abd2.1528356342.git.asias@scylladb.com>
It is useful for the client driver to know which shard is serving a
particular connection, so it can only send requests through that connection
which will be served by the same shard, eliminating a hop.
Support that by advertising a "SCYLLA_SHARD" option, with a value
corresponding to the shard number.
Acked-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180606203437.1198-1-avi@scylladb.com>
* seastar 12cffef...e7275e4 (9):
> tests: execution_stage_test: capture sg by value
> Merge "Add in-path parameter suport to the code generation" from Amnon
> Merge "Add scheduling_group inheritance to execution_stage" from Avi
> tutorial: explain how to find origin of exception
> tls: Ensure handshake always drains output before return/throw
> build: cmake: correct stdc++fs library name once more
> perftune.py: make sure config file existing before write
> Update travis-ci integration
> build: fix compilation issues on cmake. missing stdc++-fs
"
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.
Fixes#3483
"
* 'built-indexes-virtual-reader/v2' of github.com:duarten/scylla:
tests/virtual_reader_test: Add test for built indexes virtual reader
db/system_keysace: Add virtual reader for IndexInfo table
db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
index/secondary_index_manager: Expose index_table_name()
db/legacy_schema_migrator: Don't migrate indexes
If reader's buffer is small enough, or preemption happens often
enough, fill_buffer() may not make enough progress to advance
_lower_bound. If also iteartors are constantly invalidated across
fill_buffer() calls, the reader will not be able to make progress.
See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation()
for an examplary scenario.
Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>
There is not reason to use an std::set for it since we don't care about
the ordering - only about the existance of a particular entry.
Hash table will be more efficient for this use case.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528220892-5784-2-git-send-email-vladz@scylladb.com>
"
As in #3423, ensuring token order on secondary index queries can be done
by adding an additional column to views that back secondary indexes.
This column is a first clustering column and contains token value,
computed on updates.
This series also updates tests and comments refering to issue 3423.
Tests: unit (release, debug)
"
* 'order_by_token_in_si_5' of https://github.com/psarna/scylla:
cql3: update token order comments
index, tests: add token column to secondary index schema
view: add handling of a token column for secondary indexes
view: add is_index method
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.
Fixes#3454
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
In 455d5a5 (streaming memtables: coalesce incoming writes), we
introduced the delayed flush to coalesce incoming streaming mutations
from different stream_plan.
However, most of the time there will be one stream plan at a time, the
next stream plan won't start until the previous one is finished. So, the
current coalescing does not really work.
The delayed flush adds 2s of dealy for each stream session. If we have lots
of table to stream, we will waste a lot of time.
We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a
time. If we have 5000 tables, even if the tables are almost empty, the
delay will waste 5000 * 10 * 2 = 27 hours.
To stream a keyspace with 4 tables, each table has 1000 rows.
Before:
[shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
[shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds
After:
[shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
[shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds
Fixes#3436
Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>
Additional token column is now present in every view schema
that backs a secondary index. This column is always a first part
of the clustering key, so it forces token order on queries.
Column's name is ideally idx_token, but can be postfixed
with a number to ensure its uniqueness.
It also updates tests to make them acknowledge the new token order.
Fixes#3423
In order to ensure token order on secondary index queries,
first clustering column for each view that backs a secondary index
is going to store a token computed from base's partition keys.
After this commit, if there exists a column that is not present
in base schema, it will be filled with computed token.
After 70c72773be it's possible that
open_version() is called with a phase which is smaller than the phase
of the latest version, because latest version belongs to the
in-progress cache update. In such case we must return the existing
non-latest snapshot and not create a new version on top of the
in-progress update. Not doing this violates several invariants, and
may lead to inconsistencies, including violation of write atomicity or
temporary loss of writes.
partition_entry::read() was already adjusted by the aforementioned
commit. Do a similar adjustement for open_version().
Fixes sporadic failures of row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528211847-22825-1-git-send-email-tgrabiec@scylladb.com>
We mistakenly only added network-online.target is doens't promises to
wait /var/lib/scylla mount.
To do this we need local-fs.target.
Fixes#3441
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180521083349.8970-1-syuu@scylladb.com>
"
It turns out that compression just works for SSTables 3.x.
Thanks to the previous work done on the write path.
This series cleans up tests a bit and introduces test for compression
on the read path.
"
* 'haaawk/sstables3/read-compression-v1' of ssh://github.com/scylladb/seastar-dev:
Add test for compression in sstables 3.x
Extract test_partition_key_with_values_of_different_types_read
sstable_3_x_test: use SEASTAR_THREAD_TEST_CASE
Drop UNCOMPRESSD_ when code will be used for compressed too
"
This patch adds nr_shards, msb_ignore, and the actual sharding algorithm to the
system.local table. Drivers and other tools can then make use of this
information to talk to scylla in an optimal way
"
* 'system_tables-v3' of github.com:glommer/scylla:
system_keyspace: add sharding information to local table
partitioner: export the name of the algorithm used to do intra-node sharding
We would like the clients to be able to route work directly to the right
shards. To do that, they need to know the sharding algorithm and its
parameters.
The algorithm can be copied into the client, but the parameters need to
be exported somewhere. Let's use the local table for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
v2: force msb to zero on non-murmur
We will export this on system tables. To avoid hard-coding it in the system
table level, keep it at least in the dht layer where it belongs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, build_deb.sh looks very complicated because each of distribution
requires different parameter, and we are applying them by sed command one-by-one.
This patch will replace them by Mustache, it's simple and easy syntax
template language.
Both .rpm distributions and .deb distributions have pystache (a Python
implimentation of Mustache), we will use it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180604104026.22765-1-syuu@scylladb.com>
"
This series introduces a separate hinted handoff manager for materialized views.
Steps:
* decouple resource limits from hinted handoff, so multiple instances can share space
and throughput limits in order to avoid internal fragmentation for every instance's
reservations
* add a subdirectory to data/, responsible for storing materialized view hints
* decouple registering global metrics from hinted handoff constructor, now that there
can be more than one instance - otherwise 'registering metrics twice' errors are going to occur
* add a hints_for_views_manager to storage proxy and route failed view updates to use it
instead of the original hints_manager
* restore previous semantics for enabling/disabling hinted handoff - regular hinted handoff
can be disabled or enabled just for specific datacenters without influencing materialized
views flow
"
* 'separate_hh_for_mv_4' of https://github.com/psarna/scylla:
storage_proxy: restore optional hinted handoff
storage_proxy: add hints manager for views
hints: decouple hints manager metrics from constructor
db, config: add view_pending_updates directory
hints: move space_watchdog to resource manager
hints: move send limiter to resource manager
hints: move constants to resource_manager
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.
Fixes#3483
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the same comment that exists in Apache Cassandra,
explaining that the table_name column in the IndexInfo system table
actually refers to the keyspace name. Don't be fooled.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Expose secondary_index::index_table_name() so knowledge on how to
built an index name can remain centralized.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Because authorized_prepared_statements_cache caches the information that comes from
the permissions cache and from the prepared statements cache it should has the entries
expiration period set to the minimum of expiration periods of these caches.
The same goes to the entry refresh period but since prepared statements cache does have a
refresh period authorized_prepared_statements_cache's entries refresh period
is simply equal to the one of the permissions cache.
Fixes#3473
Tests: dtest{release} auth_test.py
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1527789716-6206-1-git-send-email-vladz@scylladb.com>
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.
Hints for materialized view updates need to be kept somewhere,
because their dedicated hints manager has to have a root directory.
view_pending_updates directory resides in /data and is used
for that purpose.
Constants related to managing resources are moved to newly created
resource_manager class. Later, this class will be used to manage
(potentially shared) resources of hints managers.
"
In preparation, we change LCS so that it tries harder to push data
to the last level, where the backlog is supposed to be zero.
The backlog is defined as:
backlog_of_stcs_in_l0 + Sum(L in level) sizeof(L) * (max_level - L) * fan_out
where:
* the fan_out is the amount of SSTables we usually compact with the
next level (usually 10).
* max_levels is the number of levels currently populated
* sizeof(L) is the total amount of data in a particular level.
Tests: unit (release)
"
* 'lcs-backlog-v2' of github.com:glommer/scylla:
LCS: implement backlog tracker for compaction controller
LCS: don't construct property in the body of constructor
LCS: try harder to move SSTables to highest levels.
leveled manifest: turn 10 into a constant
backlog: add level to write progress monitor
This is the last missing tracker among the major strategies. After
this, only DTCS is left.
To calculate the backlog, we will define the point of zero-backlog
as having all data in the last level. The backlog is then:
Sum(L in levels) sizeof(L) * (max_levels - L) * fan_out,
where:
* the fan_out is the amount of SSTables we usually compact with the
next level (usually 10).
* max_levels is the number of levels currently populated
* sizeof(L) is the total amount of data in a particular level.
Care is taken for the backlog not to jump when a new level has been just
recently created.
Aside from that, SSTables that accumulate in L0 can be subject to STCS.
We will then add a STCS backlog in those SSTables to represent that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now we are constructing the _max_sstable_size_in_mb property in
the body of the constructor, which it makes it hard for us to use from
other properties.
We are doing that because we'd like to test for bounds of that value. So
a cleaner way is to have a helper function for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Our current implementation of LCS can end up with situations in which
just a bit of data is in the highest levels, with the majority in the
lowest levels. That happens because we will only promote things to
highest levels if the amount of data in the current level is higher than
the maximum.
This is a pre-existing problem in itself, but became even clearer when
we started trying to define what is the backlog for LCS.
We have discussed ways to fix this it by redefining the criteria on when
to move data to the next levels. That would require us to change the way
things are today considerably, allowing parallel compactions, etc. There
is significant risk that we'll increase write amplication and we would
need to carefully validate that.
For now I will propose a simpler change, that essentially solves the
"inverted pyramid" problem of current LCS without major disruption:
keep selecting compaction candidates with the same criteria that we do
today, we should help make sure we are not compacting high levels for no
reason; but if there is nothing to do, use the idle time to push data to
higher levels. As an added benefit, old data that is in the higher level
can also be compacted away faster.
With this patch we see that in an idle, post-load system all data is
eventually pushed to the last level. Systems under constant writes keep
behaving the same way they did before.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We increase levels in powers of 10 but that is a parameter
of the algorithm. At least make it into a constant so that we can
reuse it somewhere else.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
SSTables 3.x format ('m') stores the size of previous row or RT marker
inside each row/marker. That potentially allows to traverse rows/markers
in reverse order.
The previous code calculating those sizes appeared to produce invalid
values for all rows except the first one. The problem with detecting
this bug was that neither Cassandra itself nor the sstabledump tool use
those values, they are simply rejected on reading.
From UnfilteredSerializer.deserializeRowBody() method,
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L562
:
if (header.isForSSTable())
{
in.readUnsignedVInt(); // Skip row size
in.readUnsignedVInt(); // previous unfiltered size
}
So while the previous test files were technically correct in that they
contained valid data readable by Cassandra/sstabledump, they didn't
follow the format specification.
This patchset fixes the code to produce correct values and replaces
incorrect data files with correct ones. The newly generated data files
have been validated to be identical to files generated with Cassandra
using same data and timestamps as unit tests.
Tests: Unit {release}
"
* 'projects/sstables-30/fix-prev-row_size/v1' of https://github.com/argenet/scylla:
tests: Fix test files to use correct previous row sizes.
sstables: Fix calculation of previous row size for SSTables 3.x
sstables: Factor out code building promoted index blocks into separate helpers.
"
This patchset contains two fixes to the clustering key prefixes
serialization logic for SSTables 3.x.
First, it fixes a vexing typo: a bitwise-and (&) has been used instead
of a remainder operator (%) for truncating the shift value.
This did not show up in existing tests because they all had non-empty
clustering columns values.
Added tests to cover empty clustering columns values.
Second, it fixes the logic of serialization to write values up to the
prefix length, not the length of the clustering key as defined by
schema. This matches the way it is done by the Origin.
There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
"
* 'projects/sstables-30/fix-clustering-blocks/v1' of https://github.com/argenet/scylla:
tests: Add test covering compact table with non-full clustering key.
sstables: Improve clustering blocks writing, use logical clustering prefix size.
tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
tests: Add unit test covering empty values in clustering key.
sstables: Fix typo in clustering blocks write helper.
"
Add handling for missing columns and tests for it.
There are 3 cases:
1. Number of columns in a table is smaller than 64
2. Number of columns in a table is greater than 64
2a. and less than half of all possible columns are present in sstable
2b. and at least half of all possible columns are present in sstable
Case 1 is implemented using bit mask and column is present if mask & (1 << <column number>) == 0
Case 2 is implemented by storing list of column numbers for each present column
case 3 is implemented by storing list of column numbers for each absent column
"
* 'haaawk/sstables3/read-missing-columns-v3' of ssh://github.com/scylladb/seastar-dev:
sstables 3: add test for reading big dense subset of columns
sstables 3: support reading big dense subsets of columns
sstables 3: add test for reading big sparse subset of columns
sstables 3: support reading big sparse subsets of columns
sstables 3: add test for reading small subset of columns
sstables 3: support reading small subsets of columns
Debug mode view_schema_test sometimes complains that a bool member
doesn't contain in-range values, apparenty in the move constructor.
Initialize them for its benefit to avoid false-positive test
failures.
Message-Id: <20180602184934.31258-1-avi@scylladb.com>
untyped_result_set_row's cell data type is bytes_opt, and the
get_block() accessor accesses the value assuming it's engaged
(relying on the caller to call has()).
has_unsalted_hash() calls get_blob() without calling has() beforehand,
potentially triggering undefined behavior.
Fix by using get_or() instead, which also simplifies the caller.
I observed failures in Jenkins in this area. It's hard to be sure
this is the root cause, since the failures triggered an internal
consistency assertion in asan rather than an asan report. However,
the error is hard to reproduce and the fix makes sense even if it
doesn't prevent the error.
See #3480 for the asan error.
Fixes#3480 (hopefully).
Message-Id: <20180602181919.29204-1-avi@scylladb.com>
Small subset is contains no more than 63 elements.
Support for large subsets will come in the following
patches.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
For SSTables being written, we don't know their level yet. Add that
information to the write monitor. New SSTables will always be at L0.
Compacted SSTables will have their level determined by the compaction
process.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the Origin, the size of the clustering key prefix used during
serialization is the actual length of the prefix and not the full size
as defined in schema. So the code is fixed to align with that logic.
This, in particular, is needed to write clustering blocks for RT
markers.
There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
What supposed to be an operation of taking remainder turned to be a
bitwise 'and'. This didn't show up in existing tests only because they
all had non-empty clustering values.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).
The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.
Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.
For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.
The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.
The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".
Refs #2031.
Refs #2409.
perf_simple_query -c4, medians of 30 results:
./perf_base ./perf_imr diff
read 308790.08 309775.35 0.3%
write 402127.32 417729.18 3.9%
The same with 1 byte values:
./perf_base1 ./perf_imr1 diff
read 314107.26 314648.96 0.2%
write 463801.40 433255.96 -6.6%
The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.
Memory footprint:
Before:
mutation footprint:
- in cache: 1264
- in memtable: 986
After:
mutation footprint:
- in cache: 1104
- in memtable: 866
Tests: unit (release, debug)
"
* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
tests/mutation: add test for changing column type
atomic_cell: switch to new IMR-based cell reperesentation
atomic_cell: explicitly state when atomic_cell is a collection member
treewide: require type for creating collection_mutation_view
treewide: require type for comparing cells
atomic_cell: introduce fragmented buffer value interface
treewide: require type to compute cell memory usage
treewide: require type to copy atomic_cell
treewide: require type info for copying atomic_cell_or_collection
treewide: require type for creating atomic_cell
atomic_cell: require column_definition for creating atomic_cell views
tests: test imr representation of cells
types: provide information for IMR
data: introduce cell
data: introduce type_info
imr/utils: add imr object holder
imr: introduce concepts
imr: add helper for allocating objects
imr: allow creating lsa migrators for IMR objects
imr: introduce placeholders
...
Scylla now expose the prometheus API by default. This patch chagnes
scyllatop to use the Prometheus API, the collect API is still available.
The main changes in the patch:
* Move collectd specific logic inside collectd.
* Add support for help information.
* Add command line to configure prometheus end point and to enable
collectd.
* Add a prometheus class that collect information from prometheus.
Fixes: #1541
Message-Id: <20180531124156.26336-1-amnon@scylladb.com>
Only libjsoncpp >= 1.6.0 offers a safe name() method for value
iterators. For older versions, deprecated memberName() is used
instead. Note that memberName() was deprecated because of its
inability to deal with embedded null characters.
Fixes#3471
Message-Id: <e64a62bfc24ef06daee238d79d557fe6ec8979d3.1527758708.git.sarna@scylladb.com>
With the introduction of the new in-memory representation changing
column type has become a more complex operation since it needs to handle
switch from fixed-size to variable-size types. This commit adds an
explicit test for such cases.
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
Collections are not going to be fully converted to the IMR just yet and
still use the old serialisation format. This means that they still don't
support fragmented values very well. This patch passes the information
when an atomic_cell is created as a member of a collection so that later
we can avoid fragmenting the value in such cases.
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
This commit introduces cell serializers and views based on the in-memory
representation infrastructure. The code doesn't assume anything about
how the cells are stored, they can be either a part of another IMR
object (once the rows are converted to the IMR) or a separate objects
(just like current atomic_cell).
A view schema's view_info contains the id of the base regular column
that view includes in its primary key. Since the column id of a
particular column can potentially change with a new schema version, we
need to refresh the stored column id. We weren't doing that when
unselected base columns are added, and this patch fixes it by
triggering an update of the view schema when base columns are added
and the view contains a base regular column in its PK.
Fixes#3443
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-1-duarte@scylladb.com>
IMR objects may own memory. object_allocator takes care of allocating
memory for all owned objects during the serialisation of their owner.
In practice a writer of the parent object would accept a helper object
created by object_allocator. That helper object would be either
responsible for computing the size of buffers that have to be allocated
or perform the actual serialisation in the same two phase manner as it
is done for the parent IMR object.
In some cases the actual value of an IMR object is not know at the
serialisation time. If the type is fixed-size we can use a placeholder
to defer writing it to a more conveninent moment.
This patch introduces destructors and movers for IMR objects which
enables them to own memory. Custom destructors and methods can be
defined by specialising appropriate classes.
This patch adds new way of serialising bytes and sstring objects in the
IDL. Using write_fragmented_<field-name>() the caller can pass a range
of fragments that would be serialised without linearising the buffer.
Since sstabledump and Cassandra do not use row size values, the new
files have been validated to be identical to files generated by
Cassandra with the same data inserted at same timestamps.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The previous code incorrectly calculated sizes of previous rows while
writing SSTables in 3.x ('m') format.
The problem with detecting this issue was that neither sstabledump nor
Cassandra 3.x itself use those values, as of today, they are simply
ignored when data is read from files.
Still, we want to be compatible and write correct values as they may be
of use in the future.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.
So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
"
Add handling for static rows and tests for it.
"
* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
sstable_3_x_test: Add test_uncompressed_compound_static_row_read
sstable_3_x_test: add test_uncompressed_static_row_read
flat_mutation_reader_assertions: improve static row assertions
data_consume_rows_context_m: Implement support for static rows
mp_row_consumer_m: Implement support for static rows
mp_row_consumer_m: Extract fill_cells
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:
(1) dropping partition entries from cache or memtables does not defer
(2) dropping partition versions abandoned by detached snapshots does not defer
(3) merging of partition versions when snapshots go away does not defer
(4) cache update from memtable processes partition entries without deferring (#2578)
(5) partition entries are upgraded to new schema atomically
This series fixes problems (1), (2) and (4), but not (3) and (5).
(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.
(3) and (5) are not solved yet.
(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.
Remaining work:
- Solving problem (3). I think the approach to take here would be to
move the task of merging versions to the background, maybe into mutation_cleaner.
- Merging range tombstones incrementally.
Performance
===========
Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.
For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.
For small partition case without clustering columns we see no degradation.
For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.
For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.
Below you can see full statistics for cache update run time:
=== Small partitions, no overwrites:
Before:
avg = 433.965155
stdev = 35.958024
min = 340.093201
max = 468.564514
After:
avg = 436.929447 (+1%)
stdev = 37.130237
min = 349.410339
max = 489.953400
=== Small partition with a few rows:
Before:
avg = 315.379316
stdev = 30.059120
min = 240.340561
max = 342.408295
After:
avg = 407.232691 (+30%)
stdev = 53.918717
min = 269.514648
max = 444.846649
=== Large partition, lots of small rows:
Before:
avg = 412.870689
stdev = 227.411317
min = 286.990631
max = 1263.417847
After:
avg = 124.351705 (-70%)
stdev = 4.705762
min = 110.063255
max = 129.643387
=== Large partition, lots of range tombstones:
Before:
avg = 601.172644
stdev = 121.376866
min = 223.502136
max = 874.111572
After:
avg = 695.627588 (+15%)
stdev = 135.057004
min = 337.173950
max = 784.838745
"
* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
mvcc: Use small_vector<> in partition_snapshot_row_cursor
utils: Extract small_vector.hh
mvcc: Erase rows gradually in apply_to_incomplete()
mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
cache: real_dirty_memory_accounter: Move unpinning out of the hot path
mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
mutation_partition: Reduce row lookups in apply_monotonically()
cache: Release dirty memory with row granularity
cache: Defer during partition merging
mvcc: partition_snapshot_row_cursor: Introduce consume_row()
mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
mvcc: Make apply_to_incomplete() work with attached versions
cache: Propagate phase to apply_to_incomplete()
cache: Prepare for incremental apply_to_incomplete()
Introduce a coroutine wrapper
tests: mvcc: Encapsulate memory management details
tests: cache: Take into account that update() may defer
cache: real_dirty_memory_accounter: Allow construction without memtable
cache: Extract real_dirty_memory_accounter
mvcc: Destroy memtable partition versions gently
memtable: Destroy partitions incrementally from clear_gently()
mvcc: Remove rows from tracker gently
cache: Destroy partition versions incrementally
Introduce mutation_cleaner
mvcc: Introduce partition_version_list
mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
database: Add API for incremental clearing of partition entries
cache: Define trivial methods inline
tests: Improve perf_row_cache_update
mutation_reader: Make empty mutation source advertize no partitions
Leverage the fact that it is called with monotonically increasing
positions, and avoid lookups in case the current target entry is the
successor of desired position. Reduces cache update latency by 40%
for large partition in a time-series workload.
This change speeds up merging of partition versions with many rows in
case the merged version has many rows which fall between existing rows
in the target version. This is often the case for time-series
workloads, which insert rows at the front. Lookup can be avoided for
all but the first row in the stride because we already have a
reference to the successor in the target tree, we only need to check
that the current entry in the target tree is still the successor.
This change greatly reduces amount of lookups per row during version
merging of large partitions in time-series workloads.
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
Represents a deferring operation which defers cooperatively with the caller.
The operation is started and resumed by calling run(), which returns
with stop_iteration::no whenever the operation defers and is not
completed yet. When the operation is finally complete, run() returns
with stop_iteration::yes.
This allows the caller to:
1) execute some post-defer and pre-resume actions atomically
2) have control over when the operation is resumed and in which context,
in particular the caller can cancel the operation at deferring points.
It will be used to implement deferring partition_version::apply_to_incomplete().
Curently tests have a single LSA region lock around construction of
managed objects, their manipulation, and access. This way we avoid the
complexity of dealing with allocating sections. That will not be
possible once apply_to_incomplete() is changed to enter an allocating
section itself becasue this requires region to be unlocked at
entry. The tests will have to take more fine-grained locks. That is
somewhat tricky add would add a lot of noise to tests. This patch will
make things easier by abstracting LSA management, among other things,
inside mvcc_conatiner and mvcc_partition classes.
The test incorrectly assumed that once update() is started the
cache will return only versions from last_generation. This will not
hold once we start to defer during partition merging.
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.
Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.
Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.
Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.
Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.
Refs #3289.
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:
stop_iteration clear_gently() noexcept;
It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:
return repeat([this] { return clear_gently(); });
The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
"
This series provides reasoning and clarification for the current
structure of mutate_MV(), and how we handle some scenarios related to
range movements.
"
* 'materialized-views/clarifications/v3' of github.com:duarten/scylla:
db/view: Remove ifdef'd Java code
db/view: Ignore scenario where base replica hasn't joined the ring
db/view: Handle case when base has no paired view replica
"
Add handling for clustering columns and tests for it.
"
* 'haaawk/sstables3/read-ck-v3' of ssh://github.com/scylladb/seastar-dev:
Add test_uncompressed_compound_ck_read for SSTables 3.x
Add test_uncompressed_simple_read for SSTables 3.x
Implement reading clustering key from SSTables 3.x
column_translation: cache fixed value lengths for ck
data_consume_rows_context_m: use cached fixed column value lenghts
column_translation: store fix lengths of column values
consume_row_start: change type of clustering key
Rename ROW_BODY state to CLUSTERING_ROW
We don't need to parse the type every time.
It's better to cache fix lengths of column values
for sstable.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Clustering key in 3.x format is stored differently
so it's easier to create a vector of temporary buffers
instead of a single block of concatenated bytes.
Each temporary buffer stores a value of a single
clustering column.
This is because the way clustering key is stored on disk
in SSTables 3.x is not the same as the way we store it
internally.
This means that we have to first read a value of every
clustering column into temporary_buffer and only then
we can create clustering key using a vector of those
temporary buffers.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Based on:
8daaf9833a
This patch adds a from_nodetool_style_string factory method to partition_key.
The string format is follows the nodetool format, that column in the
partition keys are split by ':'.
For example, if a partition key has two column col1 and col2, to get the
partition key that has col1 = val1 and col2 = val2:
val1:val2
execute_internal() duplicates several code paths, especially in
the select path, for no good reason. It boils down to timeout and
consistency level selection which can be done based on
client_state::is_internal().
This patchset eliminated the duplication and execute_internal(),
simplifying the code.
* github.com:avikivity/scylla cql-no-execute_internal/v2:
cql: schema_altering_statement: make execute() and execute_internal()
equivalent
cql: select_statement: make execute() and execute_internal()
equivalent
cql: query_processor: don't call cql_statement::execute_internal() any
more
cql: cql_statement: remove execute_internal()
As each test completes, report it. This prevents a long-running
test in the beginning of the list from stalling output.
Message-Id: <20180526173517.23078-1-avi@scylladb.com>
"
This series introduces frozen_mutation_fragment which can be used to
send mutation_fragments over the wire to a remote node. The main
intended user is going to be the new streaming implementation.
The first part of the series fixes some IDL issues related to empty
structures and variant being the first member of a structure. Both these
problems make the generated code fail to build and they do not, in any
way, affect the existing on-wire protocol.
Logic responsible for freezing and unfreezing of mutation_fragments is
heavily based on the existing code for freezing mutations and shares the
same drawbacks (for example, unnecessary copy during unfreezing). These
preexisting performance problems can be fixed incrementally.
Another performance problem (which affects frozen_mutations as well, but
to a lesser extent) is that since the batching is done at a different
layer each frozen mutation fragment is a separate bytes_ostream object
owning at least one memory buffer. If the mutation fragments are small
this will cause an excessive number of allocations. This could be solved
either by freezing fragments in batches (though it goes against the RPC
layer doing its own batching) or using bytes_ostream or an equivalent
object with a buffer allocation policy more suitable for such use cases.
This also is something that probably could be an incremental fix.
Tests: unit (release)
"
* tag 'frozen_mutation_fragment/v1-rebased' of https://github.com/pdziepak/scylla:
idl: add idl description of frozen_mutation_fragments
tests: add test for frozen_mutation_fragments
frozen_mutation: introduce frozen_mutation_fragment
tests/idl: test variant being the first member of a structure
idl: create variant state in root node
tests/idl: test serialising and deserialising empty structures
idl-compiler: avoid unused variable in empty struct deserialisers
tests/mutation_reader: disambiguate freeze() overload
Apache Cassandra handles a case where the node hasn't joined the ring
and may consequentially have an outdated view of it. Following the same
reasoning as with the previous patch, we ignore this scenario. It
happens when there are range movements, and this node is bootstrapping,
but there are already other mechanisms in the cluster, such as hinted
handoff and dual-writing to replicas during range movements, that
contribute to this update eventually making its way to the view.
This patch doesn't change any behavior, but it provides the reasoning
why we won't use the batchlog as Cassandra does, or the hinted handoff
log as we will, to later send the update when the node is joined (note
that Cassandra just sends the mutations "later", and doesn't check
again for any condition or change).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If no view replica is paired with the current base replica, it means
there's a range movement going on (decommission or move), such that
this base replica is gaining new token ranges. The current node is
thus a pending_endpoint from the POV of the coordinator that sent the
request.
Sending view updates to the view replica this base will eventually be
paired with only makes a difference when the base update didn't make
it to the node which is currently being decommissioned or moved-from.
The update will, however, make it to that node if HH is enabled at the
coordinator, before the range movement finishes, or later to this node
when it becomes a natural endpoint for the token.
We still ensure we send to any pending view endpoints though, at least
until we handle that case more optimally.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
All cql_statement::execute_internal() overrides now either throw or
call execute(). Since we shouldn't be calling the throwing overrides
internally, we can safely call execute() instead. This allows us to
get rid of execute_internal().
execute_internal(), for some code paths, differs from execute by the
following:
1. it uses CL_ONE unconditionally
2. it has no query timeout
3. it doesn't use execution stages
for other code paths, it just calls execute.
As preparation for getting rid of execute_internal(), unify the two
code paths.
Commit 4859b759b9 caused the consistency level and timeouts
to be provided by the caller, so using the caller provided parameters
instead of overriding them does not change behavior.
"
This patchset makes all users of query_processor specify their timeouts
explicitly, in preparation for the removal of
cql_statement::execute_internal() (whose main function was to override
timeouts).
"
* tag 'cql-explicit-timeouts/v1' of https://github.com/avikivity/scylla:
query_processor: require clients to specify timeout configuration
query_processor: un-default consistency level in make_internal_options
"
Firstly, this patchset removes the is_fixed_length() function of
abstract_type in favour of value_length_if_fixed().
Secondly, it fixed the byte_type to be compatible with Cassandra which
erroneously treats it as a variable-length data type.
Lastly, it adds a unit test covering all non-composite CQL data types
for writing.
Tests: unit {release}
"
* 'projects/sstables-30/different-data-types/v1' of https://github.com/argenet/scylla:
tests: Add a unit test for writing different data types to SSTables 3.x format.
types: Treat byte_type as a variable-length type for compatibility reasons.
types: Remove is_value_fixed() and use value_length_if_fixed() instead.
Although values of the byte_type that corresponds to CQL TINYINT type
always occupy only a single byte, Cassandra treats this it as a
variable-length type for SSTables 3.0 reading and writing.
While it is clearly a mistake at Cassandra side, we have to stay
compatible.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This patch introduces IDL definition as well as serialisers and
deserialisers for freezing mutation_fragment so that they can be
transferred between nodes in a cluster.
Each non-final IDL object is preceeded by a frame containing its size.
In case of boost::variant there is a frame for the variant itself, an
integer determining the active alternative of the variant and a frame of
that active alternative.
However, if a variant was the first member of a writable stub object the
IDL would generate code that would not write the frame for the variant.
This is not a very severe issue since there are no such cases right now
as C++ type system would no allow such generated code to compile.
Deserialisers generated by IDL compiler first create a substream
covering the deserialised structure and then skip and read appropriate
members. If there are no members the substream will be unused and prompt
the compiler to emit a warning.
"
This patch series fixes#3405: secondary-index search only provided
correct results in certain cases, where entire partitions or contiguous
partition slices matched the query. When this was not the case, and
individual clustering rows match or do not match the query, the wrong
results were returned.
To fix this bug, we need to fix the two stages of secondary-index search:
1. In the first stage, we read from the index MV a list of row keys
(i.e., primary keys) matching the query. We can no longer remember
just the partition keys, and need to keep the list of full primary keys.
2. In the second stage, we have a list of rows (not partitions) and need
to read their selected contents to return to the user. Since CQL queries
do not have a syntax to select an arbitrary list of rows, we have to
add new code to do such a selection.
Because we provide an ad-hoc, inefficient, implementation for the row
selection described in stage 2, these patches leave two paths in the code:
The old path, efficiently selecting entire partitions, and the new path,
selecting individual rows. The old path is still used when it is applicable,
which is when a partition key column or the first clustering key column
is searched.
"
* 'si-fix-v4' of http://github.com/nyh/scylla:
secondary index: test multiple clustering column
secondary index: fix wrong results returned in certain cases
secondary index: method for fetching list of rows from base table
secondary index: method for fetching list of rows from index
select_statement.cc: refactor find_index_partition_ranges()
select_statement.cc: fix variable lifetime errors
This patch adds a test for secondary indexes on a table which has many
columns - two partition key column, two clustering key columns, and two
regular columns. We add a bunch of data in various rows and partitions,
index all columns and search on this data and verify the results.
This test exposed various bugs in secondary index search, including
issue #3405. After we fixed those bugs, the test now passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The current secondary-index search code, in
indexed_table_select_statement::do_execute(), begins by fetching a list
of partitions, and then the content of these partitions from the base
table. However, in some cases, when the table has clustering columns and
not searching on the first one of them, doing this work in partition
granularity is wrong, and yields wrong results as demonstrated in
issue #3405.
So in this patch, we recognize the cases where we need to work in
clustering row granularity, and in those cases use the new functions
introduced in the previous patches - find_index_clustering_rows() and
the execute() variant taking a list of primary-keys of rows.
Fixes#3405.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We add a new variant of select_statement::execute() which allows selecting
an arbitrary list of clustering rows. The existing execute() variant can't
do that - it can only take a list of *partitions*, and read the same
clustering rows from all of them.
The new select variant is not needed for regular CQL queries (which do
not have a syntax allowing reading a list of rows with arbitrary primary
keys), but we will need it for secondary index search, for solving
issue #3405.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have a method find_index_partition_ranges(), to fetch a list
of partition keys from the secondary index. However, as we shall see in
the following patches (and see also issue #3405), getting a list of entire
partitions is not always enough - the secondary index actually holds a list
of primary keys, which includes clustering keys, and in some queries we
can't just ignore them.
So this patch provides a new method find_index_clustering_rows(), to
query the secondary index and get a list of matching clustering keys.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The function find_index_partition_ranges() is used in secondary index
searches for fetching a list of matching partition. In a following patch,
we want to add a similar function for getting a list of *rows*. To avoid
duplicate code, in this patch we split parts of find_index_partition_ranges()
into two new functions:
1. get_index_schema() returns a pointer to the index view's schema.
2. read_posting_list() reads from this view the posting list (i.e., list
of keys) for the current searched value.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
do_with() provides code a *reference* to an object which will be kept
alive. It is a mistake to make a copy of this object or of parts of it,
because then the lifetime of this copy will have to be maintained as well.
In particular, it is a mistake to do do_with(..., [] (auto x) { ... }) -
note how "auto x" appears instead of the correct "auto& x". This causes
the object to be copied, and its lifetime not maintained.
This patch fixes several cases where this rule was broken in
select_statement.cc. I could not reproduce actual crashes caused by
these mistakes, but in theory they could have happened.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
This patchset implements reading row columns from SSTable 3 format data file.
Tests: units (release)
"
* 'haaawk/sstables3/read-columns-v4' of ssh://github.com/scylladb/seastar-dev: (21 commits)
Add test for reading column values of different types.
Support all fixed size column types from SSTable 3.x
Add abstract_type::value_length_if_fixed
Add test for simple table with value
flat_reader_assertions: Add produces_row taking column values
Implement reading rows and columns in data_consume_rows_context_m
Introduce column_flags_m
Add column_translation to data_consume_rows_context_m
Pass schema to data_consume_context
Add column_translation.hh
consumer_m: Add consume methods for consuming rows and columns
Extract make_atomic_cell from mp_row_consumer_k_l
Rename NON_STATIC_ROW_* states to ROW_BODY_*
Add liveness_info and use it in reading sstables
Add helper methods for parsing simple types.
Add unfiltered_flags_m::has_all_columns
data_consume_context: use make_unique instead of new
Pass serialization_header to data_consume_rows_context*
Use disk_string_vint_size for bytes_array_vint_size
Introduce disk_string_vint_size type
...
We need to specify --configfile on pdebuild too, otherwise we will
always fail to build .deb on newly created build environment.
Only reason why we still able to build .deb is we already copied
.pbuilderrc to home directory on existing build environment.
Fixes#3456
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523204112.24669-1-syuu@scylladb.com>
It will be needed to obtain column_translation that will
be added to data_consume_context in the next patch.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
New name describes the states in a better way as those states
will be used both for static and non-static rows.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This series introduces a cache of already authenticated prepared statements which
is meant to optimize the prepared statement lookup when authentication is enabled.
This cache allows to perform a single cache lookup per EXECUTE operation as opposed
to at least 2 lookups: one in the prepared statements cache and one in the authentication
cache.
Tests:
- cql_query_test {debug, release}.
- cassandra-stress with authentication enabled and with short eviction timeout.
- Manual (with printouts) checks:
- Tested the eviction due to eviction in the prepared_statements_cache:
- Artificially decreased the prepared_statements_cache size and ran c-s with different keyspaces.
- Verified that the corresponding authorized_prepared_statements_cache entry is evicted and re-populated.
- Tested the BATCH of prepared statements (with dtest infrastructure):
- Verified that for each prepared statement authorized_prepared_statements_cache is updated only once:
- The batch contained a few entries of the same prepared statement.
"
* 'authorized_prepared_statements_cache-v3' of https://github.com/vladzcloudius/scylla:
cql3: use authorized_prepared_statements_cache in the BATCH processing
cql3::statements::batch_statement: introduce a single_statement class
cql3: introduce the authorized_prepared_statements_cache class
loading_shared_values: introduce the templated find() overload
tests: loading_cache_test: add a tests for a loading_cache::remove(key)/remove(iterator)
utils::loading_cache: add remove(key)/remove(iterator) methods
cql3::query_processor: properly stop() prepared_statements_cache object
Since Seastar no longer (1f005fb434) requires libunwind, we can
drop it from our dependency list. This helps the power build, for
which no libunwind is available.
Fixes#3453.
Message-Id: <20180523114750.10753-1-avi@scylladb.com>
"
This series implements the backlog tracker for TWCS, allowing it to
be controlled. The backlog for a TWCS colum family is just the sum of
the SizeTiered backlogs for all the windows that we know about.
A possible optimization for this is to stop tracking windows after
they become old enough and revert to zero backlog. I reverted that
last minute, though, since this will probably cause the backlog to
completely misrepresent reality if we import SSTables into old buckets
with things like repairs or nodetool refresh.
"
* 'twcs-backlog-v4.1' of github.com:glommer/scylla:
backlog: implement backlog tracker for the TWCS
STCS_backlog: allow users to query for the total bytes managed
backlog: keep track of maximum timestamp in write monitor
memtable: also keep track of max timestamp
The TWCS backlog is relatively simple: we just need to keep track of
which SSTable belong to which time window (and actually as usual,
just their sizes). That is an easy thing to do since we can statically
calculate the time bound from the timestamp.
Once we do that we can just sum the backlogs for each individual window.
Time windows that are well enough into the past can be at some point
discarded when their backlogs become zero.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The exploded_clustering_prefix type has a convenient is_empty() method
and an even more convenient "operator bool" shortcut. Unfortunately,
the other clustering prefix types (clustering_key_prefix,
clustering_key_prefix_view) have, for historic reasons, an is_empty
method which takes a schema parameter. That also means they can't
have an "operator bool" shortcut.
But checking if a prefix doesn't really need the schema - all we need to
check is whether the byte representation is empty. The result is simpler
and more efficient code, and easier to use. It is also more consistent -
all clustering-key-related types will have an "operator bool" instead of
just some of them.
To avoid massive code changes, we leave a is_empty(schema) variant, which
simply calls is_empty(). There's already precedent for that - various
methods which have a variant taking schema (and ignoring it) and one
taking nothing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180521174220.13262-1-nyh@scylladb.com>
Like with the EXECUTE command avoid authorizing the same prepared
statement twice - this time in the context of processing the BATCH
command.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This is a helper class needed to control the handling process of a single
statement in the current batch. In particular it has the boolean defining
if the authorization is needed for this statement.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Add a cache that would store the checked weak pointer to already authorized prepared statements
and which key is a tuple of an authenticated_user and key of the prepared_statements_cache.
The entries will be held as long as the corresponding prepared statement is valid (cached)
and will be discarded with the period equal to the refresh period of the permissions cache.
Entries are also going to be discarded after 60 minutes if not used.
The purpose of this new cache is to save the lookup in the permissions cache for already authenticated
resource (whatever is needed to be authenticated for the particular prepared statement).
This is meant to improve the cache coherency as well (since we are going to look in a single cache
instead of two).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This overload alows searching the elements by an arbitrary key as long as it is "hashable"
to the same values as the default key and if there is a comparator for
this new key.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
remove(key): removes the entry with the given key if exists, otherwise does nothing.
remote(iterator): removes an entry by a given iterator (returned from loading_cache::find()).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This commit makes sure that hints manager is always initialized,
including creating hints directories and starting it.
It needs to be fixed because hints manager is internally used
to store failed materialized view replicas.
Fixes#3451
Message-Id: <44532fd3704e20cabeb9c4985dace5650fd22d2c.1527018865.git.sarna@scylladb.com>
"
This series addresses issue #3202 about dropping a table with secondary
indexes present. Previously dropping such tables was impossible due to
materialized view restrictions (which is an implementation detail
of Scylla's secondary indexes).
Implemented:
* fixing 'DROP KEYSPACE' with active materialized views
* adapting schema_builder to make it easy to drop indexes
* dropping all dependent SI before dropping a table
* a test case for dropping a table with secondary indexes
"
* 'drop_si_before_drop_table_3' of https://github.com/psarna/scylla:
tests: add test for dropping a table with secondary indexes
migration_manager: allow dropping table with secondary indexes
schema: add clearing indexes to schema builder
database: do not truncate already removed views
This commit adds a test case for dropping a table with dependent
secondary indexes. Dependent materialized views prohibit the table
from being dropped, but dropping a table with dependent SI is legal.
References #3202
Previously dropping a table with secondary indexes failed, because
SI are internally backed by materialized views.
This commit triggers dropping dependent secondary indexes before
dropping a table.
Fixes#3202
This commit clears table's views before truncating it
in drop_column_family function. The only case when
views are not empty during drop is when they're backing secondary
indexes of a base table and they are all atomically dropped
in the same go as the base table itself.
This change will prevent trying to truncate views that were
already dropped, which used to result in no_such_column_family error.
References #3202
"
This series introduces materialized view statistics, as stated in issue #3385:
- updates pushed
- updates failed
- row lock stats
It also addresses issue #3416 by decoupling user write stats from view
update stats.
"
* 'materialized_view_metrics_9' of https://github.com/psarna/scylla:
view: adapt view_stats to act as write stats
storage_proxy: decouple write_stats from stats
db: add row locking metrics
view: add view metrics
We would like to know whether there is still backlog at rest in a
particular STCS object. This is useful, for instance, in the TWCS
backlog, that uses STCS so it can delete old windows that are no longer
used.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
For sealed SSTables we can get the maximum timestamp from the statistics
component. But for partially written SSTables, the metadata is not yet
available.
One way to solve this would be to make the SSTable statistics available
earlier. But we would end up with a maximum timestamp that potentially
changes all the time as we write more cells.
A better approach is to take note of what's the maximum timestamp in a
memtable before we start to flush, and when time comes for us to flush
we will use the progress manager to inform the consumers about the
maximum timestamp.
For SSTables being compacted, we can't know for sure what is the maximum
timestamp as some entries could be TTLd already. But the maximum of all
SSTables present in the compaction is a good enough estimation for this
purposes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are now keeping track of the minimum timestamp in a memtable. Also
keep track of the max timestamp so we can know what it is before we
finish flushing the entire memtable to an SSTable. Will be used by
partially written SSTables undergoing TWCS.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This was sent before as two separate patchsets. It is now unified
because it has a lot of common infrastructure.
In this patchset I am aiming at two goals:
1) Provide a minimum amount of shares for user-initiated operations like
nodetool compact and nodetool cleanup
2) Be more robust with exceptions in the backlog tracker
For the first, the main difference is that I now made the compaction
controller a part of the compaction manager. It then becomes easy to
consult with the compaction controller for the correct amount of shares
those operations should have.
In compaction_strategy.cc, the major_compaction_strategy object was
actually already unused before. So instead of making use of it, which
would require some form of information flow downwards about the backlog
we need to export, I am creating a user-initiated backlog type inside
the compaction manager.
With the two changes described above everything is very well
self-contained within the compaction manager and the implementation
becomes trivial.
For the second, I am now handling exceptions in two places:
1) the backlog computation. Those are const functions so if we just have
a transient exception when compacting the backlog, all we need to do is
return some fixed amount of shares and try again in the next adjustment
window.
2) the process of adding / removing SSTables. Those are harder, since if
we fail to manipulate the list we'll be left in an inconsistent state.
The best approach is then to disable the backlog tracker and return a
fixed amount of shares globally.
Tests: unit (release)
"
* 'backlog-improvements-v3' of github.com:glommer/scylla:
compaction_manager: disable backlog tracker if we see an exception
backlog tracker: protect against exceptions in backlog calculation.
STCS_backlog: protect against negative backlog
STCS_backlog: remove unused attribute
compaction strategy: move size tiered backlog to a header
compaction_strategy: delete major_compaction_strategy class
compaction: make sure that user-initiated compactions always have a minimum priority
backlog_controller: add constants to represent a globally disabled controller
backlog_controller: move compaction controller to the compaction manager
backlog_controller: allow users to compute inverse function of shares
This commit adapts view_stats structure so it can be passed
to storage_proxy as write stats. Thanks to that, mv replica updates
will not interfere with user write metrics. As a side effect it also
provides more stats to replica view updates.
Closes#3385Closes#3416
This commit extracts metrics related to writes from stats structure,
so it can be easily replaced later, e.g. for materialized view metrics.
References #3385
References #3416
This commit adds statistics to row_locker class. Metrics are
independendly counted for all lock types: row<->partition and
exclusive<->shared.
Metrics gathered:
- total acquisitions
- operations that wait on the lock
- histogram of the time spent on waiting on this type of lock
References #3385
References #3416
This commit introduces view statistics:
- updates pushed to local/remote replicas
- updates failed to be pushed to local/remote replicas
Metrics are kept on per-table basis, i.e. updates_pushed_remote
shows the number of total updates (mutations) pushed to all paired
mv replicas that this particular table has.
Every single update is taken into consideration, so if view update
requires removing a row from one view and adding a row to another,
it will be counted as 2 updates.
References #3385
References #3416
Compactor collects all currently active memtables and later replaces
them with the merged result. The problem is that active memtable
belongs to the input set during compaction and as a result mutations
applied concurrently with compaction could be lost once compaction
replaces the memtables. The fix is to open a new active memtable when
compaction starts.
Caused sporadic failures of row_cache_test.cc:test_continuity_is_populated_when_read_overlaps_with_older_version()
Message-Id: <1526997724-13037-1-git-send-email-tgrabiec@scylladb.com>
If we see an exception when adding or removing SSTables from the backlog
tracker, the backlog tracker can be inconsistent forever. It would be
best if we act before that happens and disable the backlog tracker. Once
the backlog tracker is disabled it will default to returning a fixed
number of shares.
We can either disable the backlog tracker or remove it. But if we remove
it we can end up with a backlog of zero if that's the only tracker with
a backlog. We then keep it registered but mark it as disabled. This also
leaves room for recovery in some situations: we can recover the backlog
by a doing a schema change in the column family that had the backlog
disabled, for instance.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Backlog calculations should be exception free, but there are at cases in
which I can see they happening. One example is if some backlog tracker
that uses temporary objects fails an allocation.
Memory shortages can be specially pernicious: if we leave the
responsibility of catching those to the individual backlog tracker, we
will keep trying to make more allocations in the other backlog trackers
if we have many column families. By handling it here we can stop that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
A negative backlog can be interpreted as a very large backlog.
Part of that is because we keep the total_size as an unsigned type,
which is what we expect. But in case there is an issue-- like an
exception that causes some SSTable not to be tracked then this size
can become negative. Returning a zero backlog is better than allowing
it to be interpreted as a giant number.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This attribute ended up being unused in the final version.
Spotted now while reading the code for other purposes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It's very common to other strategies to include a SizeTiered
step somehow inside their algorithms: LCS will do SizeTiered on
L0, TWCS will do SizeTiered within a window, etc.
To make it easier for those strategies to consume the SizeTiered
backlog tracker, we will move that to its own file.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It was already unused before this series. In an earlier version I have
used it to provide an ad-hoc backlog for major compactions. But now that
this is done by the compaction manager, this class really isn't being
used.
And it is likely it won't be: major compaction is not a compaction
strategy a user can choose, unlike the others that need to be built
through make_compaction_strategy.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have observed the following behavior with user initiated compactions,
like major compactions:
- if there are no writes, the backlog doesn't increase.
- as compaction progresses the backlog decreases.
- at some point, the backlog is so low that compaction barely makes any
progress.
Going forward, we should allow one to read from the generated partial
SSTables, in which case this doesn't matter that much. But for
user-iniated compactions we would like to guarantee a minimum baseline.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are situations in which we want the controllers to stop working
altogether. Usually that's when we have an unimplemented controller or
some exception.
We want to return fixed shares in this case, but this is a very
different situation from when we want fixed shares for *one* backlog
tracker: we want to return fixed shares, yes, but if we disable 200
backlog trackers (because they all failed, for instance), we don't want
that fixed number x 200 to be our backlog.
So the mechanism to globally disable the controller is still granted,
and infinity is a good way to represent that. It's a float that the
controller can easily test against. But actually using infinity in the
code is confusing. People reading it may interpret it as the other way
around from what it means, just meaning "a very large backlog".
Let's turn that into a constant instead. It will help us convey meaning.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.
Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.
Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Fixes#3446
Previously, only shutdown-synced objects where actually closed,
which is wrong.
This introduces yet another queue, processed together with the
deletion objects, which ensures we explicitly close all objects
that have been discarded.
Message-Id: <20180521140456.32100-1-calle@scylladb.com>
"This series leverages hinted handoff for failed view replica
updates."
* 'materialized_view_updates_with_hh_5' of https://github.com/psarna/scylla:
storage_proxy: enable hinted handoff for materialized views
storage_proxy: make view updates use consistency_level::ANY
row::find_cell() may be called for cells that do not exist in that row.
In such case nullptr shall be returned, this patch makes sure that
it is not dereferenced.
Message-Id: <20180522091726.24396-1-pdziepak@scylladb.com>
There are some situations in which we want to force a specific amount of
shares and don't have a backlog. We can provide a function to get that
from the controller.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar a6cb005...5da5d4e (6):
> append_challenged_posix_file_impl: Ensure continuation uses non-stale object
> utils: make make_visitor() public
> tcp: Adjust receive window
> tcp: Fix allowed sending size calculation in can_send
> tcp: Fix assert in tcp::tcb::output_one
> be more descriptive with failed syscalls for filesystem operations
Contains alternative fix for #3446 (will also be fixed directly).
This commit initializes and enables hinted handoff for materialized
views, even if HH is not explicitly turned on in config.
User writes still use hinted handoff only if it is explicitly enabled,
while materialized views are allowed to use it unconditionally
in order to store failed replica updates somewhere.
Fixes#3383
This commit makes view replica updates internally use consistency
level ANY, so in case an update fails it will fall back to hinted
handoff.
References #3383
install(1) creates missing directories on recent Fedora, but not
on CentOS 7. This causes the RPM build (which installs to a pristine
tree, without an existing /etc) to fail.
Fix by setting up /etc.
Tests: rpm (Fedora, CentOS)
Message-Id: <20180520124937.20466-1-avi@scylladb.com>
There is no need to call dht::split_ranges_to_shards to split the token
range into <shard> : <a lot of small ranges> mapping and create a flat
mutation reader with a lot of small ranges.
Because:
1) The flat mutation reader on each shard only returns data belongs to
this local shard, there is no correctness issue if we do not split and
feed the sub ranges only belongs to this local shard.
2) With murmur3_partitioner_ignore_msb_bits = 12, it is almost certain
that given a token range, all the shards will have data for the range
anyway. Even if we ask all the shards to work on the token range and
some of the shards have no data for it, it is fine. We simply send no
data from this shard.
Tests: update_cluster_layout_tests.py
Message-Id: <ac00cd21d6156c47b74451dd415d627481e48212.1526864222.git.asias@scylladb.com>
In streaming, the sender sends the mutations on all the local shards in
parallel, it is possible that the receiver handle more than one such
connection on the same shard. It is determined by where the tcp
connection goes. Current rpc ignores the dest shard id when sending the
rpc message.
For instance, say node1 has 2 shards, node2 has 2 shards. Currently, we
can end up with like this:
Node 1 shard 0 -> Node 2 shard 1
Node 1 shard 1 -> Node 2 shard 1
It is better if we do:
Node 1 shard 0 -> Node 2 shard 0
Node 1 shard 1 -> Node 2 shard 1
This patch solves this problem by let the handler always handle on
shard = src_cpu_id % smp::count.
If sender and receiver have the same shard config, it is completely
distributed the work evenly.
If sender and receiver do not have the same shard config, it is
unavoidable some of the shard will do more work than the others.
Tests: dtest update_cluster_layout_tests.py
Message-Id: <911827bcf67459a07ec92623a9ed4c4fbba195ca.1524622375.git.asias@scylladb.com>
Fixes#2793
Prints error handle class (commitlog or "other/disk") + exception
type and message. While not exhaustive, at least gives a correlation
point to (hopefully) other log printouts.
Message-Id: <20180509081040.7676-1-calle@scylladb.com>
"
For compression, SSTables 3.x format uses CRC32 for checksumming
compressed chunks as well as for calculating the full file checksum.
Also, while for older formats "full checksum" of a compressed data file
means a combination of checksums of its compressed chunks, in SSTables
3.x this now reads literally and assumes the checkum of all bytes
written, including per-chunk digests.
Tests: unit {debug, release}
"
* 'projects/sstables-30/write-compression/v3' of https://github.com/argenet/scylla:
tests: Add unit tests for writing compressed SSTables 3.x.
tests: Validate Digest32.crc for SSTables 3.x write tests.
tests: Fix invalid Digest file for write_counter_table test.
sstables: Support writing compressed SSTables 3.0.
sstables: Make compressed streams customizable on checksumming.
sstables: Move checksum calculation logic to compressed_output_stream.
Previously, compressed_output_stream used to calculate checksum of the
supplied chunk and pass it to the 'compression' object to combine with
the full checksum calculated on prior writes.
Now, all the checksum calculation happens inside
compressed_output_stream and 'compression' only stores the result.
This is done to loosen ties between two classes and simplify
compressed_output_stream customisation with various checksum algorithms.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.
The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
* tag 'tgrabiec/fixes-and-improvements-for-gdb-scripts-v1' of github.com:tgrabiec/scylla:
gdb: Print live object size from 'scylla lsa-segment'
gdb: Extend 'scylla segment-descs' output with full occupancy info
gdb: Print allocated object's type name instead of full LSA migrator
gdb: Fix LSA migrator discovery
gdb: Drop code related to LSA zones
gdb: Fix uses of removed segment_desctriptor::_lsa_managed
lsa: Add use for debug::static_migrators
Move code to a traditional install.sh script (more traditional would be
a "make install", but this is close enough).
This allows testing installation independently of packaging. In addition,
non-Red Hat-packaging can share much of the code in install.sh.
Ref #3243.
Tests: build+install rpm
Message-Id: <20180517114147.30863-1-avi@scylladb.com>
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.
Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.
Fixes#3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".
I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:
c3f9e6ce1f/tests/perf_row_cache_update.cc
Other workloads, and other allocation sites probably also could see the
improvement.
"
* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
lsa: Expose counters for allocation and compaction throughput
lsa: Reduce amount of segment compactions
lsa: Avoid the call to segment_pool::descriptor() in compact()
lsa: Make reclamation on reserve refill more efficient
Fixes#3339
* seastar 840002c...0a1a327 (7):
> Merge "fix perftune.py issues with cpu-masks on big machines" from Vlad
> Merge 'Handle Intel's NICs in a special way' from Vlad
> reactor: fix calculation of idle ticks
> log: streamline logging internals a little
> Merge "CMake imrovements and compatibility" from Jesse
> iotune: fix typo in property name
> cmake: do not find_package(Boost ...) if Boost is a target
"
This patchset adds support for writing counter cells in SSTables 3.x
format ('m'). The logic of writing counters is almost identical to that
used for the old 2.x format ('k'/'l') with the only difference that the
data length preceding serialised shards is written as a vint.
Tests: unit {release}.
Generated SSTables are verified to be processed fine by sstabledump
(note that sstabledump only outputs the binary data for counters, not
their actual values, same as sstable2json).
Verified with Cassandra 3.11 to get the expected values from the
counters table:
cqlsh> SELECT * from sst3.counter_table;
pk | ck | rc1 | rc2
-----+-----+-----+-----
key | ck1 | 10 | 1
(1 rows)
Verified that the deleted counter can no longer be updated:
cqlsh> use sst3 ;
cqlsh:sst3> UPDATE counter_table SET rc1 = rc1 + 2 WHERE pk = 'key' AND ck = 'ck2';
cqlsh:sst3> SELECT * from sst3.counter_table;
pk | ck | rc1 | rc2
-----+-----+-----+-----
key | ck1 | 10 | 1
(1 rows)
"
* 'projects/sstables-30/write_counters/v1' of https://github.com/argenet/scylla:
tests: Unit tests to cover writing counters in SSTables 3.x format.
sstables: Support writing counters for SSTables 3.x.
sstables: Move code writing counter value into a separate helper.
sstable test fails when running concurrently (for example, release and debug
mode) because it uses a static temporary dir in lots of tests.
Let's fix it by switching to dynamic temporary dir, which is created using
mkdtemp(). Also the sstable tests will now run in /tmp, and so it's made
much faster.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180516042044.15336-1-raphaelsc@scylladb.com>
Reclaiming memory through segment compaction is expensive. For
occupancy of 85%, in order to reclaim one free segment, we need to
compact 7 segments, by migrating 6 segments worth of data. This results
in significant amplification. Compaction involves moving objects,
which in some cases is expensive in itself as well
(See https://github.com/scylladb/scylla/issues/3247).
This patch reduces amount of segment compactions in favor of doing
more eviction. It especially helps workloads in which LRU order
matches allocation order, in which case there will be no segment
compaction, and just eviction.
In perf_row_cache_update test case for large partition with lots of
rows, which simulates appending workload, I measured that for each new
object allocated, 2 need to be migrated, before the patch. After the
patch, only 0.003 objects are migrated. This reduces run time of
cache update part by 50%.
"
The memory estimations we have when using the chunked vector
are usually slightly wrong. We can make them more accurate by
exporting the memory usage directly as a chunked_vector API.
"
* 'chunked_memory-v2' of github.com:glommer/scylla:
large_bitset: be more accurate with memory usage
chunked_vector: exports its current memory usage
We are slightly underestimating the amount of memory we use. Now that
the chunked vector can exports its internal memory usage we can use that
directly.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are times in which we would like to estimate how much memory
a chunked_vector is using. We have two strategies to do it:
1) multiply the size by the size of the elements. That is wrong, because
the chunked_vector can allocate larger chunks in anticipation of more
elements to come.
2) multiply the number of chunks by 128kB. That is also wrong, because
the chunk_vector will not always allocate the entire chunk if there are
only a few elements in it.
The best way to deal with it is to allow the chunked_vector to exports
its current memory usage.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Commit 9eb8ea8b11 installed
scylla_blocktune.py as part of preparing the rpm, but forgot
to add it to the installed file list, breaking the rpm build.
Fix by listing the file in the %files section.
Message-Id: <20180506202807.5719-1-avi@scylladb.com>
"
SSTables 3.x (format 'm') use CRC32 instead of Adler32 for calculating
checksums. This patchset introduces support for CRC32 along with Adler32
in checksummed_file_writer to be used for SSTables written in 'mc'
format.
Structures and helpers introduced for CRC32 will be later used for
calculating checksums for compressed files as well (not a part of this
patchset).
Tests: unit {release}
"
* 'projects/sstables-30/write-digest-crc/v3' of https://github.com/argenet/scylla:
tests: Add test covering checksumming SSTables 3.0 with CRC32.
sstables: Support CRC32 checksum for SSTables 3.x.
sstables: Move adler32 routines under the scope of a class.
sstables: Move checksum utils into separate header.
sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
"
The native protocol server generates mant reactor tasks that
can be easily eliminated. I measured a read workload with 100%
cache hit rate, seeing the number of tasks per request drop
from ~31 to ~27, and an increase of 3% in throughput.
"
* tag 'transport-optimize-1/v1' of https://github.com/avikivity/scylla:
transport: remove unused capture of flags variable
transport: merge response write and error handling continuations
transport: make write_repsonse() return void
transport: de-template a lambda
transport: merge memory-management and logging continuations
transport: remove gate continuation
transport: merge two response processing continuations
transport: simplify response processing continuation
transport: remove gratuitous continuation from process_request_one()
Remove implicit timeouts and replace with caller-specified timeouts.
This allows removing the ambiguity about what timeout a statement is
executed with, and allows removing cql_statement::execute_internal(),
which mostly overrode timeouts and consistency levels.
Timeout selection is now as follows:
query_processor::*_internal: infinite timeout, CL=ONE
query_processor::process(), execute(): user-specified consisistency level and timeout
All callers were adjusted to specify an infinite timeout. This can be
further adjusted later to use the "other" timeout for DCL and the
read or write timeout (as needed) for authentication in the normal
query path.
Note that infinite timeouts don't mean that the query will hang; as
soon as the failure detector decides that the node is down, RPC
responses will termiante with a failure and the query will fail.
Make the consistency level explicit in the caller in order to clarify
what is going on.
An "internal" query used to mean that it was accessing local tables,
so infinite timeouts and a consistency level of ONE were indicated,
but authentication accesses non-local tables so explicit consistency
level and timeouts are needed.
It just schedules the response, and returns immediately.
(I thought about calling it schedule_response(), but usually it will
write the response immediately, since waiting for network writes is
rare in a local network).
with_gate() generates a continuation if the protected function defers.
Avoid that by merging a gate::leave() call with another, preexisting,
continuation.
We have one coninuation transforming the result, and another shutting
down tracing. Since the first cannot defer, we can merge the two, reducing
the number of tasks processed by the reactor.
This patch fixes a bug where queries using a secondary index would, in
some cases, produce the same rows multiple times.
The problem was that the code begins by finding a list of primary keys
that match the search, and then work on the partitions containing them.
If multiple rows matched in the same partition, the partition was considered
multiple times, and the same rows were output multiple times.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180510203141.17157-1-nyh@scylladb.com>
If we're given a single reader (can be common in a low-write-rate table,
where most of the data will be in a single large sstable, or in leveled
tables) then we can avoid the overhead of the combining reader by returning
the single input.
Tests: unit (release)
Message-Id: <20180513130333.15424-1-avi@scylladb.com>
"
This patchset implements separate timeouts for range queries, and lays
the foundations for separate timeouts for other query types.
While the feature in itself is worthy, the real motivation is to have
the timeouts decided by the caller, instead of storage_proxy. This in
turn is required to disentangle each layer behaving differently
depending on whether the query is internal or not; instead, the goal
is to have each caller declare its needs in terms of consistency level
and timeouts, and have the lower layers implement its requirements
instead of making their own decisions.
Fixes#3013.
Tests: unit (release)
"
* tag '3013/v1.1' of https://github.com/avikivity/scylla:
storage_proxy: remove default_query_timeout()
storage_proxy: don't use default timeouts
query_options: augment with timeout_config
thrift: configure thrift transport and handler with a timeout_config
transport: configure native transport with a timeout_config
cql3: define and populate timeout_config_selector
timeout_config: introduce timeout configuration
Before we accept running while not in developer mode, we verify that
the I/O Scheduler is properly configured. Up until now, that meant
verifying that --max-io-requests is properly set and that the number
of I/O Queues is enough to leave at least 4 requests per I/O Queue.
Systems that move to newer versions of Scylla may continue doing that,
so we need to be backwards compatible and keep testing for that.
However, newer systems will not set that option, but pass a YAML
property file (or string) instead. So we need to make sure that
either one of those is set.
If the property file is set, I am deciding here not to test for
number of I/O queues. scylla_io_setup will usually configure that
anyway, plus we plan on soon moving to all-shards-dispatch making
that less important.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180509163737.5907-1-glauber@scylladb.com>
There is an ongoing discussion in issue 2678 about the right time to
release permits. Right now we are releasing the permit after we flush
all data for the memtable plus the SSTables accompanying components -
plus flushing them, closing them, etc.
During all that time, we are increasing virtual dirty by adding more
data to the buffers but we are not able to decrease it-- until we
release the permit we can't start flushing the next memtable. This is
much more of a concern than I/O overlapping as described in the issue.
We have a hook in the SSTable write process that is (should be) called
as soon as data is written. We should move the permit release there.
We aren't, though, calling that as early as we could. The call to the
data written hook is writing after the Index is closed, summary is
sealed, etc.
This patch fixes that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-2-glauber@scylladb.com>
Currently reserve refill allocates segments repeatedly until the
reserve threhsold is met. If single segment allocation needs to
reclaim memory, it will ask the reclaimer for one segment. The
reclaimer could make better decisions if it knew the total number of
segments we try to allocate. In particular, it would not attempt to
compact any segment until it evicts total amount of memory first,
which may reduce the total amount of segment compactions during
refill.
This patch changes refill to increase reclamation step used by
allocate_segment() so that it matches the total amount of memory we
refill.
We have conflict between scylla-libgcc72/scylla-libstdc++72 and
scylla-libgcc73/scylla-libstdc++73, need to replace *72 package with
scylla-2.2 metapackage to prevent it.
Fixes#3373
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180510081246.17928-1-syuu@scylladb.com>
"Previously, partition tombstone was not written for partitions with no
rows causing corrupted data files.
This is now fixed and covered with tests.
In addition, we now track partition tombstones while collecting encoding
statistics."
* 'projects/sstables-30/fix-partition-tombstone/v3' of https://github.com/argenet/scylla:
tests: Don't use deprecated schema constructor.
tests: Add tests to cover partitions consisting only of partition keys.
sstables: Make sure partition level tombstone is written for partitions with no rows.
memtable: Collect statistics from partition-level tombstone.
"
This is preparatory cleanup series with fixes/cleanup of miscellaneous
issues that I discovered while working on the stateful range-scans.
Since the stateful range-scans series, even without these patches, is a
20+ patches strong series I'd like to fast-track this, to ease reviewing
the former.
Most of the changes here are related to code-hygenie and effectiveness
and there is a patch that is correctness-related ("querier: check only
the end bound of ranges when matching them") and one that is related to
ease-of-use ("range: clean the deduced transformed type").
Note that altough these changes were made in the context of working on
the stateful range-scans they make sense on their own as well.
Tests: unit(release, debug)
"
* '1865/pre-range-scans-cleanup/v1' of https://github.com/denesb/scylla:
multishard_combining_reader: use optimized optional for the shard reader
Use dht::token_range alias for last/preferred replicas
storage_proxy::coordinator_query_result: merge constructors into one w/ default params
querier: check only the end bound of ranges when matching them
querier: take range and slice by value
querier: remove const params from make_compaction_state()
querier: make _range and _slice const
flat_multi_range_mutation_reader: optimize for non-plural range vectors
range: clean the deduced transformed type
"
Fixes#3420.
Tests: dtest (`auth_test.py`), unit (release)
"
* 'jhk/fix_3420/v2' of https://github.com/hakuch/scylla:
cql3: Include custom options in LIST ROLES
auth: Query custom options from the `authenticator`
auth: Add type alias for custom auth. options
"
These patches were extracted from much larger series that introduces new
in-memory representation of cells. They contain various enhanecments and
fixes that to a varying degree make sense on its own. Sending them
separately will hopefuly ease the review and merging proces of the whole
IMR effort.
Tests: unit(release).
"
* tag 'pre-imr/v1' of https://github.com/pdziepak/scylla:
tests/perf: add microbenchmarks for basic row operations
tests: simple_schema: add make_row_from_serialized_value()
row: add clear_hash()
types: move compare_unsigned() to bytes.hh
lsa: provide migrator with the object size
lsa: add free() that does not require object size
db/view/build_progress: avoid copying mutation fragment
mutation_partition: enable ADL for cell swap
types: make some collection_type_impl functions non-static
counters: drop revertability of apply()
mutable_view: add default constructor and const_iterator
tests/mutation_reader: do not apply mutations created on another shard
sstables: do not call atomic_cell::value() for dead cells
lsa: sanitize use of migrators
lsa: reuse registered migrator ids
lsa: make migrators table thread-local
The querier provides a `matches(const nonwrapping_range&)` member to
allow for checking whether a range matches that with which the querier
was originally created. The check for match is more lax than a strict
equality check as ranges are shrunk query progresses.
Because of this the above member only checked that one of the bounds of
the examined ranges matches. This is adequate as for this purpose
because, in the context of a single query, it is guaranteed that no
two read requests to the same replica will have overlapping range.
However Avi pointed out in a recent, related review, that this check can
be made a little more strict by requiring that the end-bounds of the
two ranges *always* matches, instead of allowing any of the bounds to
match.
Don't create a flat_multi_range_mutation_reader when the range vector
has 0 or 1 element. In the former case create an empty reader and in the
latter just create a reader with the mutation-source with the only range
in the vector.
wrapping_range and nonwrapping_range offer a transform() member function
which allows creating a new range by applying a transformer function to
the bounds of the current range. The type of bounds of the new range is
deduced from the return type for this transformer function. However the
return type is used as-is, with any CV or reference attached to it.
Since it doesn't make sense to create a range of references or a type
with CV qualifiers strip these off the deduced type.
An implementation of `authenticator` can support custom options for a
each role.
If, to make up an example, the authenticator supported the `region` key,
then a role would be created as follows:
CREATE ROLE jsmith WITH OPTIONS = { 'region': 'north_america' }
AND PASSWORD = 'super_secure';
LIST ROLES will now print this custom option map as an additional column
with the heading "options".
However, none of the implementations of `authenticator` in Scylla
currently support OPTIONS, so LIST ROLES will in practice, for now,
print the empty set:
role | super | login | options
-----------+-------+-------+---------
cassandra | True | True | {}
None of the `authenticator` implementations we have support custom
options, but we should support this operation to support the relevant
CQL statements.
simple_schema::make_row() is not very well suited for performance tests
of row and cell creation since it serialises the value. This patch
introduces a new function that performs only minimal actions.
compare_unsigned() is a general utility function that compares two
bytes_view byte-by-byte. There is no need to include whole type.hh in
order to make it available.
While the migration function should have enough information to obtain
the object size itself, the LSA logic needs to compute it as well.
IMR is going to make calculating object sizes more expensive, so by
providing the infromation to the migrator we can avoid some needless
operations.
It is non-trivial to get the size of an IMR object. However, the
standard allocator doesn't really need it and LSA can compute it itself
by asking the migrator.
Calling fully qualified std::swap() prohibits the cell objects from
using their own swap implementations. This patch invokes std::swap in
the usual ADL-friendly way.
The switch to the new in-memory representation will require a larger
parts of the logic be aware of the type of the values they are dealing
with. In most cases it is not a significant burden for the users.
Scylla uses shared-nothing architecture and communication between the
shards is supposed to be very restricted. Applying to a memtable
mutations created on another shard is way to complex operation to be
allowed. Using frozen mutations is a much safer option.
Having migrators dynamically registered and deregistered opens a new
class of bugs. This patch adds some additional checks in the debug mode
with the hopes of catching any misuse early.
With the introduction of the new in-memory representation we will get
type- and schema-dependent migrators. Since there is no bound how many
times they can be created and destroyed it is better to be safe and
reuse registered migrator ids.
"
SSTable 3.0 format introduces serialization header which is used in reading SSTables in that format.
This patchset implements loading of this new component of Statistics.db.
Tests: units (release)
"
* 'haaawk/sstables3/load_serialization_header_v2' of ssh://github.com/scylladb/seastar-dev:
Load serialization_header from statistics
Add parse for disk_array_vint_size
Add helpers to read/parse vints
Add signed_vint::serialized_size_from_first_byte
Add sstable::get_serialization_header
Move random_access_reader to separate header
250ms is too high of a period for memtable controller. Since memtable
flushes are relatively efficient, specially in comparison to
compactions, if the shares are high we can flush a lot of data down with
the high shares - so in the next adjustment period our shares will be
minuscule and we won't flush much at all.
This leads to oscillating behavior that is mitigated by adjusting
faster.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-3-glauber@scylladb.com>
"Fixes a bug in partition_snapshot::merge_partition_versions(), which would not
attempt merging if the snapshot is attached to the latest version (in which
case _version is nullptr and _entry is != nullptr). This would cause
partition_version objects to accumulate if there was an older snapshot and it
went away before the latest snapshot. Versions will be removed when the whole
entry goes away (flush or eviction).
May cause performance problems.
Fixes #3402."
* 'tgrabiec/fix-merge_partition_versions' of github.com:tgrabiec/scylla:
mvcc: Test version merging when snapshots go away
anchorless_list: Make ranges conform to SinglePassRange
anchorless_list: Drop deprecated use of std::iterator
mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
The newer version of iotune, recently merged to Seastar, accepts
a new parameter that tells us where should we store the properties
about the disk.
We are already generating that properties file for the AMI case.
Let's also pass that parameter when calling iotune.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180507175757.9144-1-glauber@scylladb.com>
Starting from 2018.1 and 2.2 there was a change in the repository path.
It was made to support multiple product (like manager and place the
enterprise in a different path).
As a result, the regular expression that look for the repository fail.
This patch change the way the path is searched, both rpm and debian
varations are combined and both options of the repository path are
unified.
See scylladb/scylla-enterprise#527
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180429151926.20431-1-amnon@scylladb.com>
"
In SSTables 3.0, the base and increment fields have been swapped in
Bloom filters to reduce collisions (see CASSANDRA-8413). This affects
the resulting values written to Filter.db.
This patchset adds support for reading/writing Filter.db in the format
corresponding to the version of SSTables.
Tests: unit {release}
Filter.db files have been generated using Cassandra 3.11 with same data
as in unit tests and are validated to match those generated by Scylla.
"
* 'projects/sstables-30/write-filter/v1-2' of https://github.com/argenet/scylla:
Fix mistakes and typos in comments (minor clean-up)
Check Filter.db in SSTables 3.x write tests.
Support Bloom filter format used in SSTables 3.0.
Remove unused overload of i_filter::get_filter().
The two hash values, base and increment, used to produce indices for
setting bits in the filter, have been swapped in SSTables 3.0.
See CASSANDRA-8413 for details.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Fixes crash in cql_tests.StorageProxyCQLTester.table_test
"avoid race condition when deleting sstable on behalf..." changed
discard_sstables behaviour to only return rp:s for sstables owned
and submitted for deletion (not all matching time stamp),
which can in some cases cause zero rp returned.
Message-Id: <20180508070003.1110-1-calle@scylladb.com>
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.
What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.
After all hints are drained the corresponding hints directory is removed.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Returning a future with an exception from end_point_manager::stop()
is practically useless because the best the caller can do is to log
it and continue as if it didn't happen because it has other things
to shut down.
Therefore in order to simplify the caller we will log the exception
if it happens and will always return a non-exceptional future.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This ensures we respect the write timeout set by the client when
applying base writes, in case a writes takes too long to acquire the
row lock for the read-before-write phase of a materialized view
update.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180507132755.8751-1-duarte@scylladb.com>
* seastar ac02df7...840002c (20):
> dpdk: protect against missing statistics
> alien: make visible in documentation
> Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
> app_template: Correct outdated comment
> apps, tests: Catch polymorphic exceptions by reference
> configure.py: Enhance detection for gcc -fvisibility=hidden bug
> reactor: add rudimentary task histogram reporting
> Revert "Merge "rewrite iotune to conform to the new ioscheduler" from Glauber"
> Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
> build: Use the same warning name for Clang and GCC
> core/rwlock: Add support for timeouts
> fs qualification: protect against EINTR
> Docker: Fix failing build due to missing GNU make
> reactor: move optional to experimental so we compile with c++14
> future: remove allocation from future::get() thread context switch
> Merge "rpc streaming" from Gleb
> reactor: put mountpoint_params in seastar namespace
> Tutorial: in PDF version of tutorial, better backtick typesetting
> tutorial: support, and start using, links to other sections
> tutorial: improve second half of semaphores section
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a simple and naive mechanism to ensure a base replica
doesn't overwhelm a potentially overloaded view replica by sending too
many concurrent view updates. We add a semaphore to limit to 100 the
number of outstanding view updates. We limit globally per shard, and
not per destination view replica. We also limit statically.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-2-duarte@scylladb.com>
The functions in cql_assertions.hh are very convenient, but have one
frustrating drawback: When you have many of those assertions in one
test, it's very hard to know *which* of the similar assertions failed.
The problem is that an error often looks like this:
unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/cql_assertions.cc(131): last checkpoint
Which of the many similar checks in "test_many_columns" failed? Note the
unhelpful "unknown location" and also the "last checkpoint" points to code
in cql_assertions.cc, not in the actual test, so it is useless.
The root cause of these problems is that the Boost macros use the C
preprocessor __FILE__ and __LINE__, which in actual C++ functions like
is_rows() remembers its location, instead of the caller. Fixing this will
not be simple. But this patch has a much simpler solution - fixing the
"last checkpoint". What ruins the last checkpoint is the use of BOOST_REQUIRE
inside the cql_assertions.cc is_rows() - when that succeeds, it records
the location inside cql_assertions.cc (!) as the last success.
If we just replace BOOST_REQUIRE by our own test (just like in the rest of
the cql_assertions.cc code), this code will not override the last checkpoint.
The user can see the last real successful BOOST_REQUIRE, or use
BOOST_TEST_PASSPOINT() to set his own checkpoints between different parts of
the same test.
After this patch, and with adding BOOST_TEST_PASSPOINT() calls between
different parts of my test, the failure above now looks like:
unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/secondary_index_test.cc(299): last checkpoint
The "last checkpoint" now shows me exactly where my failing check was.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501152638.26238-1-nyh@scylladb.com>
"This patchset adds ostream operators to result_message and uses them
in cql_assertions."
* tag 'result_message-print/v1.1' of https://github.com/avikivity/scylla:
tests: cql_assersions: improve error message when a row is not found
transport: add ostream support to result_message
transport: const correctness for result_message::accept()
Use utf8_type where warranted.
Fixes view_schema_test failure where the rows did not match. I don't
understand exactly why the failure happened (using the wrong type
should not cause a failure here), but the change fixes the problem.
Tests: view_schema_test (release)
Message-Id: <20180506130015.7450-1-avi@scylladb.com>
The visitor does not alter the result_message it is visiting (and
its signature indicates that) so accept() should be const-qualified
to indicate that and to allow visiting const result_message:s.
"
This patchset adds support for writing Statistics.db in the SSTables
'mc' (3.x) format. This file is essential for reading data stored in
Data.db as it contains base values used for delta encoding and types of
columns.
This patchset also fixes several bugs found in writing data and index
files as well as bugs in a statistics-related structure definition.
Tests: unit {debug, release}
All SSTables files for write unit tests are validated to be processed by
sstabledump and output is verified to show the expected data.
"
* 'projects/sstables-30/write-statistics/v1' of https://github.com/argenet/scylla:
Add test covering the composite partition key case.
Add Statistics.db files to write tests for SSTables 3.0.
Do not check rows and cells for expiration when writing them to the data file.
Fix promoted index serialization.
Fix the order of items in stats_metadata.
Fix timestamp_epoch value which was truncated on exceeding int32_t type limit.
Write serialization header to Statistics.db for SSTables 3.x.
Do not pass schema to metadata_collector::update(column_stats)
Collect metadata statistics when writing SSTables 3.0.
Call get_metadata_collector() instead of referencing sstable::_collector directly.
Fix logic of writing TTLed cells in SSTable 3.0 format.
Separate statistics for count of cells, columns and rows in column_stats.
Deserialize collection in a way that doesn't incur shared_ptr counter increment and is generally shorter.
Track both min & max values for timestamp, TTL and local deletion time in metadata_collector.
Add class for tracking both extremum values (min and max) on updates.
Mainly to check that the composite type is properly serialized when
writing serialization header to Statistics.db.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For these tests to work, all time-related values are now fixed as these
are stored in Statistics.db files.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Although this logic may be seen as a useful optimization, it hinders
unit tests writing SSTables 3.0 as those need to have fixed time-related
values to produce Statistics.db files with the same content on each run.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
There is a new field introduced in the SSTables 3.0 index file format
named 'partition_header_length' that can be used to skip over to the
first clustering row in a wide partition. This one has not been
previously written and caused malformed indices.
Updated the corresponding test to include a static row and write
multiple wide partitions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Serialization header is a new components in Statistics.db introduced in
SSTables 3.0 ('ma') format. It is essential for reading data file as it
contains the base values used for delta-encoded values (timestamps,
TTLs, local deletion times) and description of column types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
After reboot, all existing sstables are considered shared. That's a safe default.
Reader used by compaction decides to use filtering reader (filters out data that
doesn't belong to this shard) if sstable is considered shared even though it may
actually be unshared.
By avoiding filtering reader we're avoiding an extra check for each key, and that
may be meaningful for compaction of tons of small partitions and even range
reads of such. We do so by fixing sstable::_shared, which is now set properly for
existing sstables at start.
quick check using microbenchmark which extends perf_sstable with compaction mode:
before: 69407.61 +- 37.03 partitions / sec (30 runs, 1 concurrent ops)
after: 70161.09 +- 40.35 partitions / sec (30 runs, 1 concurrent ops)
Fixes#3042.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180504182158.21130-1-raphaelsc@scylladb.com>
"
This series introduces a system.large_partitions table,
used to gather information on largest partitions in the cluster.
Schema below allows easy extraction of most offending keys and removal
by sstable name, which happens when a table is compacted away.
Schema: (
keyspace_name text,
table_name text,
sstable_name text,
partition_size bigint,
key text,
compaction_time timestamp,
PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);
"
Closes#3292.
* 'large_partition_table_3' of https://github.com/psarna/scylla:
database, sstables, tests: add large_partition_handler
db: add large_partition_handler interface with implementations
docs: init system_keyspace entry with system.large_partitions
db: add system.large_partitions table
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.
Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
This commit introduces large_partition_handler class, which can be used
to take additional action when large partitions are written.
It comes with two implementations:
* NOP, used in tests, which does nothing on large partition
update/delete
* CQL TABLE, which inserts/deletes information on particular sstable
to system.large_partitions table, in order to be retrievable from
cqlsh later.
References #3292
This commit adds a system.large_partitions table, which can be used
to trace largest partitions of a cluster.
Schema: (
keyspace_name text,
table_name text,
sstable_name text,
partition_size bigint,
key text,
compaction_time timestamp,
PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);
References #3292
After removal of deletion manager, caller is now responsible for properly
submitting the deletion of a shared sstable. That's because deletion manager
was responsible for holding deletion until all owners agreed on it.
Resharding for example was changed to delete the shared sstables at the end,
but truncate wasn't changed and so race condition could happen when deleting
same sstable at more than one shard in parallel. Change the operation to only
submit a shared sstable for deletion in only one owner.
Fixes dtest migration_test.TestMigration.migrate_sstable_with_schema_change_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180503193427.24049-1-raphaelsc@scylladb.com>
A step to untie classes sstable_writer_m and sstable so that eventually
we could stop them being friends.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
SSTables 3.0 format makes a distinction between count of cells and count
of columns. In that sense, a column of a collection type counts as one
column but every atomic cell in it counts as a separate cell.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
test_multishard_combining_reader_destroyed_with_pending_create_reader
was failing because it relied on smp == 3 and thus the shard on which
the reader creation is blocked being shard-2. Since the test requires to
be run with smp >= 3 we can hardcode this shard to be 2 because if the
test runs at all we are guaranteed to have at least smp >= 3.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <38883a1f4c18ca0cd065aa13826a4f1858353289.1525328233.git.bdenes@scylladb.com>
These tests are quite complicated and require intimate knowledge of how
foreign_reader and multishard_combining_reader operates. Knowing these
two objects is still required to understand the tests but make it that
much easier by explaining how they were designed to test what they test.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8de580131a8652924de920c2bc68a98e579398ee.1525328226.git.bdenes@scylladb.com>
'shard' is a short-lived on-stack variable that gets captured by
reference by continuation that gets executed on another shard.
Fixes a race condition that leads to an heap-use-after-free.
Message-Id: <20180502150507.2776-1-pdziepak@scylladb.com>
The test_foreign_reader_destroyed_with_pending_read_ahead test currently
doesn't ensure that the objects in it's scope are destroyed in the
correct order. This is necessary as there are severeal foreign pointers
to objects that live on remote shards and use each other. Since
foreign pointers destory their managed object in the background we
cannot rely on the to reliably destroy objects in order, nor can we be
sure when the object they manage is actually destroy.
So to work around that ensure that the puppet_reader is destroyed before
the remote_control it references even has a chance of being destroyed.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <232eaa899878b03fb2a765c2916e4f05841472a3.1525269726.git.bdenes@scylladb.com>
Test for Scylla's default choice of secondary index name (we found one
small problem, see issue #3403, and left it commented out). Also test
the ability to give indices non-default names.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501153439.26619-1-nyh@scylladb.com>
Add a test that adding a secondary-index for an only partition key column
is not allowed (it would be redundant), but indexing one of several partition
key columns *is* allowed. This reproduced issue #3404, and verifies that
it was fixed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-2-nyh@scylladb.com>
Indexing an only partition key component is not allowed (because it would
be redundant), but it should be allowed to index one of several partition
key components. We had a bug in that case: the underlying materialized view
we created had the same column as both a partition key and a clustering
key, which resulted in an assertion failure. This patch fixes that.
Fixes#3404.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-1-nyh@scylladb.com>
The db/index directory contains just a few lines of code that exists
there for historical reasons. It's confusing that we have both db/index
and index/ directory related to secondary-indexing.
This patch moves what little is still in db/index/ to index/. In the
future we should probably get rid of the "secondary_index" class we had
there, but for now, let's at least not have a whole new directory for it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501101246.21143-1-nyh@scylladb.com>
Fixes a bug in partition_snapshot::merge_partition_versions(), which
would not attempt merging if the snapshot is attached to the latest
version (in which case _version is nullptr and _entry is !=
nullptr). This would cause partition_version objects to accumulate if
there was an older snapshot and it went away before the latest
snapshot. Versions will be removed when the whole entry goes away
(flush or eviction).
May have caused performance problems.
Fixes#3402.
"
Both multishard_combining_reader and foreign_reader use read-head in the
background to avoid blocking consumers. These read-aheads can be still
pending when the reader is destroyed and hence extra attention is needed
to avoid memory errors. Recent manual testing, done in the context of
testing code that is using the multishard reader, proved that these
cases were not handled correctly in the initial series introducing it
(2d126a79b).
This series introduces fixes and comprehensive tests for all problematic
scenarios:
1) multishard_combining_reader is destroyed with pending reader creation
on a remote shard.
2) foreign_reader is destroyed with pending read-ahead.
3) multishard_combining_reader is destroyed with pending read-ahead.
"
* 'multishard-reader-read-ahead-fixes/v2' of https://github.com/denesb/scylla:
test.py: add custom seastar flags for mutation_reader_test
test.py: move custom seastar flags for tests declarative
mutation_reader_test: add read-ahead related multishard reader tests
tests/mutation_reader_test: change recommented smp to 3
mutation_reader_test: fix name of existing multishard reader tests
simple_schema: add global_simple_schema
simple_schema.hh: remove unused include
multishard_combining_reader: prepare for read-ahead otliving the reader
foreign_reader: prepare for read-ahead outliving the reader
multishard_combining_reader: avoid creating the shard reader twice
multishard_combining_reader: read_ahead: don't assume reader is created
multishard_combining_reader: move read-ahead related methods
multishard_combining_reader: avoid looking up the shard reader twice
multishard_combining_reader: use optional for maybe created reader
Add tests for foreign_reader and multishard_combining_reader that check
that readers destroyed while there is pending read-head will not result
in use-after-free.
Specifically check that:
* multishard_combining_reader destroyed with pending reader creation
* foreign_reader destroyed with pending read-ahead
* multishard_combining_reader destroyed with pending read-ahead
does not result in use-after-free or SEGFAULT.
These tests try to do their best to check for correct behaviour with
various BOOST_REQUIRE* checks but they still heavily rely on ASAN to
detect any use-after-free, SEGFAULT or similar errors.
Of the test_multishard_combining_reader_reading_empty_table test.
Running this test with smp=3 instead of smp=2 helps detecting additional
read-ahead related memory problems.
Which allows a simple_schema instance to be transferred to another
shard. In fact a new simple_schema instance will be created on the
remote shard but it will use the same schema instance the the original
one.
When the multishard reader is destroyed there might be severeal pending
read-aheads running in the background. These read-aheads need their
associated reader to stay alive until after the read-ahead completes.
To solve this move the flat_mutation_reader into a struct and manage
this struct's lifetime through a shared pointer. Fibers associated with
read-aheads that might outlive the multishard reader will hold on to a
copy of the shard pointer keeping the underlying reader alive until they
complete. To avoid doing any extra work a flag is added to this state
which is set when the multishard reader is destroyed. When this flag is
set, pending continuations will return early. All this is encapsulated
in multishard_combining_reader::shard_reader the multishard reader code
itself need not be changed.
The foreign reader keeps track of ongoing read-aheads via a
foreign_ptr to the read-ahead's future on the remote shard. This pointer
is overwritten after each "remote call" to the remote reader with a
pointer to the future of the new read-ahead's future.
There are severeal problems with the current implementation:
1) There is a new read-ahead launched after each "remote call"
unconditionally, even if the remote reader is at EOS. This will start
unecessary read-ahead when the reader is already finished and may be
soon destroyed (legally) by the client.
2) The pointer to the remote read-ahead future is not set to nullptr
when a remote call is issued. Thus in the destructor, where we
attach a continuation to the read-ahead's future to extend the
reader's lifetime until after the read-ahead finishes, we migh attach
a continuation to a future that already has one and run into a failed
assert().
To fix this issues reset the read-ahead pointer to nullptr each time a
remote call is issued and don't start a new read-ahead if the remote
reader is at EOS. This way we can ensure that when the reader is
destroyed we either have a valid and non-stale read-aead future or none
at all and can reliably make a decision about whether we need to extend
the lifetime of the remote reader or not.
The multishard reader creates its shard readers on demand when they are
first attempted to be used. However at this time the reader migh already
be in the progress of being created, initiated by a previous read-ahead.
To avoid creating the shard reader twice, before creating the reader
check whether there are any read-aheads in progress. If there is, it
already created (is creating or will create) the reader and hence
synchronise with the read ahead. Synchronisation happens via a promise,
the read ahead creates a promise which will be fulfilled when the reader
is created. A concurrent create_reader() call will wait on this promise
instead of attempting to create a new reader.
Currently it is assumed that when read_ahead is called the reader is
already created. Under most circumstances this will not be true. It was
blind (bad) luck that we didn't hit this before (during testing).
To the group of methods that do not assume the reader is already
created. A patch will follow that will update read_ahead() to not assume
that the reader is created.
After a little "research" [1] it turns out my initial fears were
completely without ground, std::optional::operator->() and
std::optional::opterator*() doesn't involve an unnecessary branch and
thus there is no need to hand-roll an optional with a separate bool.
[1] http://en.cppreference.com/w/cpp/utility/optional/operator*
Determine which timeout we need to apply at prepare time. We
don't know the numerical value (since it depends on whoever is
executing the query, not just the statement type), but we know
which member of timeout_config we need, so determine and remember
that.
The mutation forwarding intermediary (src_addr) may not always know
about the schema which was used by the original coordinator. I think
this may be the cause of the "Schema version ... not found" error seen
in one of the clusters which entered some pathological state:
storage_proxy - Failed to apply mutation from 1.1.1.1#5: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 32893223-a911-3a01-ad70-df1eb2a15db1): std::runtime_error (Schema version 32893223-a911-3a01-ad70-df1eb2a15db1 not found)
Fixes#3393.
Message-Id: <1524639030-1696-1-git-send-email-tgrabiec@scylladb.com>
Confirm that issue #2991 is indeed fixed - creating a secondary index
with IF NOT EXISTS ignores an already existing index, and dropping with
IF EXISTS ignores a non-existant index.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180430071714.10154-1-nyh@scylladb.com>
The existing test_secondary_index_case_sensitive only tested the
case-sensitive case of the column being indexed, and only in some
scenarios. Further testing exposed more bugs - issue #3388, issue #3391,
issue #3401. This patch adds tests which reproduced those bugs, and now
verifies their fix.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-9-nyh@scylladb.com>
test_case_sensitivity from tests/view_schema_test.cc was well-intentioned,
aiming to test from different angles the issue of non-lowercase (quoted)
column names and their interaction with materialized views.
But unfortunately, it didn't test anything! This is because the quotation
marks were forgotten, so all the identifier in this test were folded to
lowercase, and the test didn't test non-lowercase identifiers like it
intended.
So this patch adds the missing quotes, to make this test great again.
After the patches for issues #3388 and #3391 which I sent earlier, the
test *passes* (before those patches, the fixed test did not pass -
the unfixed test trivially passed).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-8-nyh@scylladb.com>
When the secondary index code builds a "%s IS NOT NULL" clause for a
CQL statement, it needs to quote the column name if it needs to be
(not only lowercase, digits and _).
Fixes#3401.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-7-nyh@scylladb.com>
We had another case-sensitivity bug in materialized views, where if
a case-sensitive (quoted) column name was listed explicitly on "SELECT"
(instead of implicitly, e.g., in "SELECT *") the column name was
incorrectly folded to lower-case and inserts would fail.
This patch fixes the code, where a "SELECT" statement was built using
the desired column names, but column names that needed quoting were
not being quoted. The bug was in a helper function build_select_statement()
which took column name strings and failed to quote them. We clean up this
function to take column definitions instead of strings - and take care
of the quoting itself. It also needs to quote the table's name in the
select statement being built.
Fixes#3391.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-6-nyh@scylladb.com>
Before this patch, if a materialized view is defined with the restriction
IS NOT NULL on a case-sensitive (quoted) column name, inserts fail with
a "restriction 'foobar IS NOT null' unknown column foobar" error, where
foobar is the lowercased version of the case-sensitive column name.
The problem is that the code uses single_column_relation::to_string()
to convert the relation into a CQL where clause. And indeed, this method
generates a CQL expression; But it calls column_identifier::raw::to_string()
to print identifiers. This is the wrong function - it doesn't quote
identifiers that need quoting because they are not lowercase.
So this patch uses column_identifier::raw::to_cql_string() (a method we
added in the previous patch) to generate the properly quoted CQL relation.
Fixes#3388
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-5-nyh@scylladb.com>
Implement a method column_identifier::raw::to_cql_string(). Exactly like
the one without "raw", this method quotes the identifier name as needed
for CQL. We'll need this method in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-4-nyh@scylladb.com>
There is no reason for to_cql_string() and maybe_quote() to both
implement the same quoting algorithm. Use the latter to implement the
former.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-3-nyh@scylladb.com>
The utility function maybe_quote() is supposed to quote identifier names
(name of keyspace, table, or column) according to CQL rules, e.g., if the
name has any uppercase or non-alphanumeric characters, it needs to be
quoted. Unfortunatelty, it didn't quite do the right thing, so this patch
fixes that. This patch also adds a comment explaining what maybe_quote()
is supposed to do (until now, users could only guess).
Fixes#3400.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-2-nyh@scylladb.com>
In commit d674b6f672, I fixed a case-
sensitive column name bug by avoiding CQL quoting of a column name
in create_index_statement.cc when building a "targets" option string.
However, there is also matching code in target_parser.hh to unquote
that option string. So this unquoting code is no longer necessary, and
should be dropped.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-1-nyh@scylladb.com>
Different request types have different timeouts (for example,
read requests have shorter timeouts than truncate requests), and
also different request sources have different timeouts (for example,
an internal local query wants infinite timeout while a user query
has a user-defined timeout).
To allow for this, define two types: timeout_config represents the
timeout configuration for a source (e.g. user), while
timeout_config_selector represents the request type, and is used
to select a timeout within a timeout configuration. The latter is
implemented as a pointer-to-member.
Also introduce an infinite timeout configuration for internal
queries.
In the current code, if the base table has a compound partition key (i.e.,
multiple partition-key columns) searching its secondary indexes didn't work.
There is no real reason why this, it was a just a bug in preparing the
second query:
Every SI query is converted to two queries. The first queries the associated
materialized view, to find a list of primary keys. Those we need to use in a
second query, of the base table. The second query needs to list, as
restrictions, the keys found above. When a partition key is compound, its
components build one key and one restriction. But in the buggy code, we
incorrectly used each component as a separate (improperly formatted) key
and restriction, and obviously this didn't work.
This patch also adds a test that reproduces this problem and confirms its fix.
In the fixed code I also found another incorrect use of to_cql_string() (which
could break case-sensitive primary key column names) and changed it to
to_string().
Fixes#3210.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429124138.24406-1-nyh@scylladb.com>
We make multiple attempts to mark a node as alive. We do that be
sending an EchoMessage, and marking the node as alive upon receiving a
successful answer. In case there's a network partition and the nodes
can't reach each other, multiple messages may be delivered and
processed.
We can avoid processing duplicate EchoMessage replies by checking
whether we had already marked the node as alive.
Fixes#1184
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180428191942.31990-1-duarte@scylladb.com>
"
Recently many changes have landed in seastar for the I/O Scheduler. We
can now describe the I/O storage of a machine by its visible properties
like throughput and bandwidth instead of relying in an indirect
calculation.
For the instances we support, we can just measure that and start using
them right away.
A version of iotune that computes those properties is not yet ready, but
in its making I have noticed that we aren't really setting the nomerges
and scheduler properties of the disks under testing. We definitely
should, since that can influence the results. So this patchset also
starts doing that.
The commandline for iotunev2 shouldn't change much. When it is ready we
will just adjust this script once more.
"
* 'scylla_io_setup' of github.com:glommer/scylla:
scylla_io_setup: preconfigure i3 and i2 instances with new I/O scheduler properties
scylla_lib: drop support for m3 and c3 AWS instance types
io_setup: call blocktune before tuning I/O
blocktune: allow it to be called as a library.
scripts: move scylla-blocktune to scripts location
* seastar 70aecca...ac02df7 (5):
> Merge "Prefix preprocessor definitions" from Jesse
> cmake: Do not enable warnings transitively
> posix: prevent unused variable warning
> build: Adjust DPDK options to fix compilation
> io_scheduler: adjust property names
DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro
references prefixed with SEASTAR_. Some may need to become
Scylla macros.
iterator incorrectly dereferenced when timestamp resolution not
explicitly specified.
following dtests are fixed:
compaction_additional_test.CompactionAdditionalStrategyTests_with_TimeWindowCompactionStrategy.compaction_is_started_on_boot_test
compaction_additional_test.CompactionAdditionalTest.compact_data_by_time_window_test
compaction_additional_test.CompactionAdditionalTest.compaction_removes_ttld_data_by_time_windows_test
compaction_test.TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180427192545.17440-1-raphaelsc@scylladb.com>
We can use iotunev2 (or any other I/O generator) to test for the limits
of the disks for the i2 and i3 instance classes. The values I got here
are the values I got from ~5 invocations of the (yet to be upstreamed)
iotune v2, with the IOPS numbers rounded for convenience of reading.
During the execution, I verified that the disks were saturated so we
can trust these numbers even if iotunev2 is merged in a different form.
The numbers are very consistent, unlike what we usually saw with the
first version of iotune.
Previously, we were just multiplying the concurrency number by the
number of disks. Now that we have better infrastructure, we will
manually test i3.large and i3.xlarge, since their disks are smaller
and slower.
For the other i3, and all instances in the i2 family storage scales up
by adding more disks. So we can keep multiplying the characteristics of
one known disk by the number of disks and assuming perfect scaling.
Example for i3, obtained with i3.2xlarge:
read_iops = 411k
read_bandwidth = 1.9GB/s
So for i3.16xlarge, we would have read_iops = 3.28M and 15GB/s - very
close to the numbers advertised by AWS.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
m3 has 80GB SSDs in its largest form and I doubt anybody has ever
used it with Scylla.
I am also not aware of any c3 deployments. Since it is past generation,
it doesn't even show up in the default instance selector anymore.
I propose we drop AMI support for it. In practice, what that means is
that we won't auto-tune its I/O properties and people that want to use
it will have to run scylla_io_setup - like they do today with the EBS
instances.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are not configuring the disks the way we want them with respect to
scheduler and nomerges. This is an oversigh that became clear now that
I started rewriting iotune-- since I will explicitly test for that. But
since this can affect the results, it should be here all along.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch makes the functions in scylla-blocktune available as a
library for other scripts - namely scylla_io_setup.
The filename, scylla-blocktune, is not the most convenient thing to call
from python so instead of just wrapping it in the usual test for
__main__ I am just splitting the file into two.
Another option would be to patch all callers to call
scylla_blocktune.py, but because we are usually not using extensions in
scripts that are meant to be called directly I decided for the split.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
scylla-blocktune currently lives in the top level but this is mostly
historical. When time comes for us to install it, the packaging systems
will copy it to /usr/lib/scylla with the others.
So for consistency let's make sure that it also lives in the scripts
directory.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
After upgrade from 1.7 to 2.0, nodes will record a per-table schema
version which matches that on 1.7 to support the rolling upgrade. Any
later schema change (after the upgrade is done) will drop this record
from affected tables so that the per-table schema version is
recalculated. If nodes perform a schema pull (they detect schema
mismatch), then the merge will affect all tables and will wipe the
per-table schema version record from all tables, even if their schema
did not change. If then only some nodes get restarted, the restarted
nodes will load tables with the new (recalculated) per-table schema
version, while not restarted nodes will still use the 1.7 per-table
schema version. Until all nodes are restarted, writes or reads between
nodes from different groups will involve a needless exchange of schema
definition.
This will manifest in logs with repeated messages indicating schema
merge with no effect, triggered by writes:
database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
The sync will be performed if the receiving shard forgets the foreign
version, which happens if it doesn't process any request referencing
it for more than 1 second.
This may impact latency of writes and reads.
The fix is to treat schema changes which drop the 1.7 per-table schema
version marker as an alter, which will switch in-memory data
structures to use the new per-table schema version immediately,
without the need for a restart.
Fixes#3394
Tests:
- dtest: schema_test.py, schema_management_test.py
- reproduced and validated the fix with run_upgrade_tests.sh from git@github.com:tgrabiec/scylla-dtest.git
- unit (release)
Message-Id: <1524764211-12868-1-git-send-email-tgrabiec@scylladb.com>
"
This patch series introduces initial support for writing SSTables in
'mc' format (aka SSTables 3.0).
Currently, the following components are written in 3.0 format:
- Data.db
- Index.db
- Summary.db
(there were no changes to summary files format compared to ka/la)
Other SSTables components are written in the old format for now as they
still need to exist to satisfy post-flush processing.
For now, only rows are written to the data file and indexed. Range
tombstones are not supported.
Writing rows is supported in full with the only exception being counter
cells. All the other features (TTLed data, row/cell level tombstones,
collections, etc) are supported.
Unit tests rely on producing files and binary-comparing them with
'golden' copies that are produced using Cassandra 3.11. This is done to
not block until reading SSTables 3.0 format is implemented.
=======================================
Implementation notes
=======================================
Internally, sstable_writer has been refactored to support multiple
implementations that are instantiated in its constructor based on the
sstable version. Little to no code is shared among sstable_writer_v2 and
sstable_writer_v3 as we only intend to support sstable_writer_v2
alongside sstable_writer_v3 for a single release (to be able to do
rollback on rolling upgrade failure) and then plan to get rid of it
entirely and switch to always writing SSTables in the new format.
The design of sstable_writer_v3 mostly follows that of its precursors
sstable_writer(_v2) and components_writer. Some refactoring and further
code rearrangements are expected in the future but the main code is
there.
"
* 'projects/sstables-30/write-rows/v2' of https://github.com/argenet/scylla:
Add tests for writing data and index files in SSTables 3.0 ('mc') format.
Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
Add missing enum values to bound_kind.
Add building blocks for writing data in SSTables 3.0 format.
Refactor sstable_writer to support various internal implementations.
Add is_fixed_length() to data types.
Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
bound_kind::clustering, bound_kind::excl_end_incl_start and
bound_kind::incl_end_excl_start are used during SSTables 3.0 writing.
bound_kind::static_clustering is not used yet but added for completeness
and parity with the Origin.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For any given CQL data type, this member returns whether its values are
of fixed or variable length. This is used by SSTables 3.0 format to only
store the length value for variable-length cells.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
This patchset prepares everything for support of both 2.x and 3.x formats and implements reading from sstable 3.x
very simple table with just partition keys.
Tests: units (release)
"
* 'haaawk/sstables3/read_only_partitions_v4' of ssh://github.com/scylladb/seastar-dev: (22 commits)
Test for reading sstable in MC format with no columns
Use new mp_row_consumer_m and data_consume_rows_context_m
Introduce mp_row_consumer_m
Rename mp_row_consumer to mp_row_consumer_k_l
Introduce consumer_m and data_consume_rows_context_m
Use read_short_length_bytes in RANGE_TOMBSTONE
Use read_short_length_bytes in ATOM_START
Use read_short_length_bytes in ROW_START
Add continuous_data_consumer::read_short_length_bytes
Reduce duplication with continuous_data_consumer::read_partial_int
Add test for a simple table with just partition key
Add test for reading index
Extract mp_row_consumer to separate header
Make sstable_mutation_reader independent from mp_row_consumer
Make sstable_mutation_reader a template
Make data_consume_context a template
Move data_consume_rows_context from row.cc to row.hh
Decouple sstable.hh and row.hh
Reduce visibility of sstable::data_consume_*
Move data_consume_context to separate header
...
Take DataConsumeRowsContext type as parameter.
This will allow us to implement different context
for reading 3.x files.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Parametrize it with the type of data consume rows context.
There will be different implementations used for different
sstable file formats.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It will be used as a template parameter for sstable_mutation_reader
once it's turned into a template. This means the definition has
to be accessible.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
They are used just in partition.cc, row.cc and sstables_test.cc
so it is usefull to cut their scope by moving them
to data_consume_context.hh.
This will make it much easier to turn data_consume_context into
a template.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It's used only in row.cc, partition.cc and sstables_test.cc
so it's better to reduce the dependency just to those files.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
On some build environment we may want to limit number of parallel jobs since
ninja-build runs ncpus jobs by default, it may too many since g++ eats very
huge memory.
So support --jobs <njobs> just like on rpm build script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180425205439.30053-1-syuu@scylladb.com>
"
This patchset brings in a statistics collector that tracks minimal
values for timestamps, TTLs and local deletion times for all the updates
made to a given memtable.
This statistics is later used when flushing memtables into SSTables
using 3.x ('mc') format to delta-encode corresponding values using
collected minimums as bases (that is why it is called encoding
statistics).
This patchset is sent out apart from other changes that introduce
writing SSTables 3.x to facilitate read path implementation that also
needs the encoding_stats structure.
The tests for write path implicitly cover this functionality as any rows
written to a SSTable 3.0 file make use of delta-encoding.
"
* 'projects/sstables-30/collect-encoding-statistics-v4' of https://github.com/argenet/scylla:
Collect encoding statistics for memtable updates.
Factor out min_tracker and max_tracker as common helpers.
Always pass mutation_partitions to partition_entry::apply()
We keep track of all updates and store the minimal values of timestamps,
TTLs and local deletion times across all the inserted data.
These values are written as a part of serialization_header for
Statistics.db and used for delta-encoding values when writing Data.db
file in SSTables 3.0 (mc) format.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
They will be re-used for collecting encoding statistics which is needed
to write SSTables 3.0.
Part of #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Previously it was also possible to pass a frozen_mutation to it.
Now we de-serialize frozen mutations at the calling side.
This is a pre-requisite for collecting memtable statistics needed for
writing into the SSTables 3.0 format.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
When provisioning a Scylla docker image with --developer-mode 0 (disabled)
scylla_raid_setup is not invoked. As a consequence the "data" directory is not
created and scylla_io_setup fails (steps to reproduce and error message provided
at the end).
This patch adds the same verifications present in scylla_io_setup to docker's
scyllasetup.py and creates the data directory in the case it is not present.
--
Steps to reproduce on AWS i3.2xlarge with Ubuntu 16.04:
sudo -s
apt update && apt upgrade -y && apt-get install docker.io -y
mdadm --create --verbose --force --run /dev/md0 --level=0 -c1024 --raid-devices=1 /dev/nvme0n1
mkfs.xfs /dev/md0 -f -K
mkdir /var/lib/scylla
mount -t xfs /dev/md0 /var/lib/scylla
docker run --name some-scylla \
--volume /var/lib/scylla:/var/lib/scylla \
-p 9042:9042 -p 7000:7000 -p 7001:7001 -p 7199:7199 \
-p 9160:9160 -p 9180:9180 -p 10000:10000 \
-d scylladb/scylla --overprovisioned 1 --developer-mode 0
docker logs some-scylla
running: (['/usr/lib/scylla/scylla_dev_mode_setup', '--developer-mode', '0'],)
running: (['/usr/lib/scylla/scylla_io_setup'],)
terminate called after throwing an instance of 'std::system_error'
what(): open: No such file or directory
ERROR:root:/var/lib/scylla/data did not pass validation tests, it may not be on XFS and/or has limited disk space.
This is a non-supported setup, and performance is expected to be very bad.
For better performance, placing your data on XFS-formatted directories is required.
To override this error, enable developer mode as follow:
sudo /usr/lib/scylla/scylla_dev_mode_setup --developer-mode 1
failed!
Traceback (most recent call last):
File "/docker-entrypoint.py", line 15, in <module>
setup.io()
File "/scyllasetup.py", line 34, in io
self._run(['/usr/lib/scylla/scylla_io_setup'])
File "/scyllasetup.py", line 23, in _run
subprocess.check_call(*args, **kwargs)
File "/usr/lib64/python3.4/subprocess.py", line 558, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/lib/scylla/scylla_io_setup']' returned non-zero exit status 1
ls -latr /var/lib/scylla
total 4
drwxr-xr-x 44 root root 4096 Abr 24 13:02 ..
drwxr-xr-x 2 root root 6 Abr 24 13:10 .
Signed-off-by: Moreno Garcia <moreno@scylladb.com>
Message-Id: <20180424173729.22151-1-moreno@scylladb.com>
Fixes#3187
Requires seastar "inet_address: Add constructor and conversion function
from/to IPv4"
Implements support IPv6 for CQL inet data. The actual data stored will
now vary between 4 and 16 bytes. gms::inet_address has been augumented
to interop with seastar::inet_address, though of course actually trying
to use an Ipv6 address there or in any of its tables with throw badly.
Tests assuming ipv4 changed. Storing a ipv4_address should be
transparent, as it now "widens". However, since all ipv4 is
inet_address, but not vice versa, there is no implicit overloading on
the read paths. I.e. tests and system_keyspace (where we read ip
addresses from tables explicitly) are modified to use the proper type.
Message-Id: <20180424161817.26316-1-calle@scylladb.com>
CQL normally folds identifiers such as column names to lowercase. However,
if the column name is quoted, case-sensitive column names and other strange
characters can be used. We had a bug where such columns could be indexed,
but then, when trying to use the index in a SELECT statement, it was not
found.
The existing code remembered the index's column after converting it to CQL
format (adding quotes). But such conversion was unnecessary, and wrong,
because the rest of the code works with bare strings and does not involve
actual CQL statements. So the fix avoids this mistaken conversion.
This patch also includes a test to reproduce this problem.
Fixes#3154.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424154920.15924-1-nyh@scylladb.com>
This commit fixes two closely related issues with handling
case-sensitive column names in JSON:
* according to doc, case-sensitive names should be wrapped with
additional pair of double quotes during JSON SELECT
* logic error in parse_json() prevented INSERT JSON from working
properly on case-sensitive column names
This commit is followed by updated cql_query_test, which checks
case-sensitive cases as well.
Message-Id: <82d9d5e193a656e99bc86b297c00662a6fb808a0.1524576066.git.sarna@scylladb.com>
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.
Tests: units (release)
"
* 'haaawk/sstables3/loading_v4' of ssh://github.com/scylladb/seastar-dev:
Add test for loading the whole sstable
Add test for loading statistics
Add support for 3_x stats metadata
Pass sstable version to describe_type
Pass sstable version to write methods
metadata_type: add Serialization type
Pass sstable_version_types to parse methods
Add test for reading filter
Add test for read_summary
sstables 3.x: Add test for reading TOC
sstable: Make component_map version dependent
sstable::component_type: add operator<<
Extract sstable::component_type to separete header
Remove unused sstable::get_shared_components
sstable_version_types: add mc version
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
I forgot that I also need to update test.py for the new test.
It's unfortunate that this script doesn't pick up the list of
tests automatically (perhaps with a black-list of tests we don't
want to run). I wonder if there are additional tests we are
forgetting to run.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424085911.29732-1-nyh@scylladb.com>
Move the two tests we have for the secondary indexing feature from the
huge tests/cql_query_test.cc to a new file, secondary_index_test.cc.
Having these tests in a separate file will make it easier and faster to
write more tests for this feature, and to run these tests together.
This patch doesn't change anything in the tests' code - it's just a code
move.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424084700.28816-1-nyh@scylladb.com>
Old versions of JsonCpp declare the following typedefs for internally
used aliases:
typedef long long int Int64;
typedef unsigned long long int UInt64;
In newer versions (1.8.x), those are declared as:
typedef int64_t Int64;
typedef uint64_t UInt64;
Those base types are not identical so in cases when a type has
constructors overloaded only for specific integral types (such as
Json::Value in JsonCpp or data_value in Scylla), an attempt to
pack/unpack an integer from/to a JSON object causes ambiguous calls.
Fixes#3208
Tests: unit {release}.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e9fff9f41e0f34b15afc90b5439be03e4295623e.1524556258.git.vladimir@scylladb.com>
We were feeding the total estimation partition count of an input shared
sstable to the output unshared ones.
So sstable writer thinks, *from estimation*, that each sstable created
by resharding will have the same data amount as the shared sstable they
are being created from. That's a problem because estimation is feeded to
bloom filter creation which directly influences its size.
So if we're resharding all sstables that belong to all shards, the
disk usage taken by filter components will be multiplied by the number
of shards. That becomes more of a problem with #3302.
Partition count estimation for a shard S will now be done as follow:
//
// TE, the total estimated partition count for a shard S, is defined as
// TE = Sum(i = 0...N) { Ei / Si }.
//
// where i is an input sstable that belongs to shard S,
// Ei is the estimated partition count for sstable i,
// Si is the total number of shards that own sstable i.
Fixes#2672.
Refs #3302.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com>
"
Fixes to several issues around view update generation, pertaining to
timestamp and TTL management.
Fixes#3361Fixes#3360Fixes#3140
Refs #3362
Tests: unit(release, debug), dtest(materialized_views.py)
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'materialized-views/fixes-galore/v2' of http://github.com/duarten/scylla:
mutation_partition: Clarify comment about emptiness
tests: Add view_complex_test
tests/view_schema_test: Complete test
db/view: Move cells instead of copying in add_cells_to_view()
db/view: Handle unselected base columns and corner cases
mutation_partition: Regular base column in view determines row liveness
db/view: Don't avoid read-before-write when view PK matches base
db/view: Process base updates to column unselected by its views
db/view: Consider partition tombstone when generating updates
tests/view_schema_test: Remove unneeded test
mutation_fragment: Allow querying if row is live
view_info: Add view_column() overload
view_info: Explicitly initialize base-dependent fields
cql3/alter_table_statement: Forbid dropping columns of MV base tables
This patch fixes several cases where it was disallowed to create
a materialized view with a filter ("where ..."), for no good reason.
After this patch, these cases will be allowed. Fixes#2367.
In ordinary SELECT queries, certain types of filtering which is known to
be deceptively inefficient is now allowed. For example, trying to query
a range of partition keys cannot be done without reading the entire
database (because the murmur3 tokenizer randomizes the order of partitions).
Restricting two partition key components also cannot be done without
reading excessive amount of the entire partition. So Scylla, following
Cassandra, chooses to disallow such SELECT queries, and give an error
message.
However, the same SELECT statements *should* be allowed when defining a
materialized view. In this case, the filter is just used to check an
individual row - not to search for one - so there is no performance
concern.
Unfortunately the existing code did these validations while building the
SELECT statement's "restrictions", in code shared by both uses of SELECT
(query and MV definition). It was easy to move one of the validations
to later code which runs after the restriction has already been built (and
knows if it is working for query or MV), but because of the way the
"restrictions" objects (translated from Cassandra 2's code) hide what they
contain, many of the checks are harder to perform after having built the
restrictions object. So instead, we add in strategic places in the
restriction-handling code a new "allow_filtering" flag. If restrictions
are built with allow_filtering=true, the extra performance-oriented tests
on the filtering restrictions is not done. Materialized views sets
allow_filtering=true.
The allow_filtering flag will also be useful later when we want to support
the "ALLOW FILTERING" query option which is currently not supported properly
(we have several open issues on that). However note that this patch doesn't
complete that support: I left a FIXME in the spot where we set
allow_filtering in the Materialized Views case, but in the futre also need
to set it if the user specified "ALLOWED FILTERING" in the query.
This patch also enables several unit tests written by Duarte which used to
fail because of this bug, and now pass. These tests verify that the
restrictions are now allowed and filter the view as desired; But I also
added test code to verify that the same restrictions are still forbidden,
as before, when used in ordinary SELECT queries.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180423124343.17591-1-nyh@scylladb.com>
"
Make sure install_dependencies.sh installs all the right dependencies
and that the example `configure.py` invokation can just be copy-pasted
into the terminal and will "just work".
Ref: #3208
"
* 'fix_centos_compile/v2' of https://github.com/denesb/scylla:
install_dependencies.sh: update centos package list and example
configure.py: add --with-ragel option
configure.py: add --with-antlr3
configure.py: check compiler version first
Add missing packages to `yum install` list:
* scylla-boost163-static
* scylla-python34-pyparsing20
Update the configure.py example so that it just works:
* Change g++ to 7.3
* Add --with-antlr3 pointing to antlr3 installed from scylla 3rdparty
Before checking anything else (presence of boost, its version, etc.)
check that the compiler is present and can compile and link a simple c++
program.
Before if the compiler was not set up correctly configure.py would fail
at one of the other try_compile checks, whichever came first (usually
the one checking for boost). This lead the user into chasing some
false-positive error when in fact the compiler wasn't working.
Debian 8 causes "Invalid argument" when we used AmbientCapabilities on systemd
unit file, so drop the line when we build .deb package for Debian 8.
For other distributions, keep using the feature.
Fixes#3344
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180423102041.2138-1-syuu@scylladb.com>
"
This series complements JSON support with INSERT JSON and fromJson
cql function.
INSERT JSON implementation tries hard to interfere as little as possible
with regular INSERT path. So, after being parsed, insertJsonStatement
exists as a separate statement and is handled in a special way.
Overridden add_update_for_key extracts values from JSON map and applies
them to columns.
Converting from insert_json_statement to insert_statement uses auxiliary
from_json_object methods to convert JSON-encoded types to bytes.
Then, terms are matched to appropriate column names and cells are
updated.
fromJson CQL function uses the same from_json_object helper methods,
but applies them to single arguments, not whole rows.
Existing json handling functions from json.hh and libjsoncpp were used
where possible.
Things implemented:
* expanding CQL grammar to accept INSERT JSON
* converting JSON representation of cql values to cql terms
* serving 'INSERT INTO xxx JSON yyy' clause
* tests for INSERT JSON and fromJson()
"
* 'json_ops_2' of https://github.com/psarna/scylla:
tests: add cql unit tests for INSERT JSON
cql3: add fromJson() function
cql3: add INSERT JSON parsing to CQL grammar
cql3: add support for INSERT JSON clause
cql3: decouple execute from term binding in setters
cql3: change operation::make_* functions to static
cql3: add from_json_object function to types
cql3: Make literals::NULL_VALUE public
This commit adds tests for INSERT JSON clause, which is expected
to accept JSON strings and insert appropriate values to columns
defined there.
The tests also cover fromJson function calls and inserting prepared
batch statements with INSERT JSON inside.
References #2058
This function extends JSON support with fromJson() function,
which can be used in UPDATE clause to transform JSON value
into a value with proper CQL type.
fromJson() accepts strings and may return any type, so its instances,
like toJson(), are generated during calls.
This commit also extends functions::get() with additional
'receiver' parameter. This parameter is used to extract receiver type
information neeeded to generate proper fromJson instance.
Receiver is known only during insert/update, so functions::get() also
accepts a nullptr if receiver is not known (e.g. during selection).
References #2058
This commit adds the implementation of INSERT JSON clause
which accepts JSON object as parameter and inserts appropriate
values into appropriate columns, as defined in given JSON.
Example:
INSERT INTO testme JSON '{
"id" : 77,
"name" : "Jones",
"ranking" : 8.5
}'
References #2058
This commit makes it possible to pass values to setters,
instead of having to pass cql3::term instances.
Thanks to that previously prepared terminals can be directly
used in a setter execution.
References #2058
This commit makes operation::make* functions static, because they
don't access any instance-specific data anyway. It is later needed
to decouple setter execution from binding a cql3::term.
This commit adds a 'from_json_object' method which will be used
for converting JSON representation of a value to raw bytes representing
the same value. This functionality will be needed by 'INSERT JSON'
clause implementation, which can turn these raw bytes into cql3::term.
References #2058
continuous_data_consumer_test takes an unreasonable amount of
time to run, especially in debug mode. Reduce the run time by
reducing the number of loops.
Message-Id: <20180422150938.29143-1-avi@scylladb.com>
This patch introduces view_complex_test and adds more test coverage
for materialized views.
A new file was introduced to avoid making view_schema_test slower.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.
This patch ensures that unselected columns are considered as much as
possible, even though some limitations will still exist. In
particular, we need to represent multiple timestamps (from all the
unselected columns), but have only mechanisms to record a single
timestamp.
We also have some issues when dealing with selected column, and the
way we currently delete them. Consider the following:
create table cf (p int, c int, a int, b int, primary key (p, c))
create materialized view vcf as select a, b
from cf where p is not null and c is not null
primary key (p, c)
1) update cf using timestamp 10 set a = 1 where p = 1 and c = 1
2) delete a from cf using timestamp 11 where p = 1 and c = 1
3) update cf using timestamp 1 set a = 2 where p = 1 and c = 1
After 1), the MV should include a row with row marker @ ts10,
p = 1, c = 1, a = 1. After 2), this row should be removed.
At 3), we should add a row with row marker @ ts1, p = 1, c = 1, a = 1,
with a lower timestamp. This means that the delete should not
insert a row tombstone with timestamp @ 11, as we do now but it should
just delete the view's row marker (which exists) with ts1.
Refs #3362Fixes#3140Fixes#3361
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When views contain a primary key column that is not part of the base
table primary key, that column determines whether the row is live or
not. We need to ensure that when that cell is dead, and thus the
derived row marker, either by normal deletion of by TTL, so is the
rest of the row.
This patch introduces the idea of shawdowing row marker. We map the
status of the regular base column in the view's PK to the view row's
marker. If this marker is dead, so is that cell in the base table, and
so should the view row become. To enforce that, a view row's dead
marker shadows the whole row if that view includes a base regular
column in its PK.
Fixes#3360
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. When calculating the view's row marker we need
to access those unselected columns, so we can't avoid the
read-before-write as we were doing.
Refs #3362
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. So, process base updates to columns unselected by
any of its views.
Refs #3362
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Not adding the partition tombstone to the current list of tombstones
may cause updates to be incorrectly generated.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of lazily-initializing the regular base column in the view's
PK field, explicitly initialize it. This will be used by future
patches that don't have access to the schema when wanting to obtain
that column.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.
The fact that unselected columns can keep a view row alive also
requires that users cannot drop columns of base tables with
materialized views, which this patch implements.
Refs #3362
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.
Tests: units (release)
"
* 'haaawk/sstables3/loading_v3' of ssh://github.com/scylladb/seastar-dev:
Add test for loading the whole sstable
Add test for loading statistics
Add support for 3_x stats metadata
Pass sstable version to describe_type
Pass sstable version to write methods
metadata_type: add Serialization type
Pass sstable_version_types to parse methods
Add test for reading filter
Add test for read_summary
sstables 3.x: Add test for reading TOC
sstable: Make component_map version dependent
sstable::component_type: add operator<<
Extract sstable::component_type to separete header
Remove unused sstable::get_shared_components
sstable_version_types: add mc version
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The compression_parameter constructor is called with an extra level of
parentheses. Presumably this caused a temporary object to be constructed
and then moved into the argument being initialized, but gcc 8 complains
about ambiguity.
Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121854.12314-1-avi@scylladb.com>
The token constructor is called with an extra level of parentheses. Presumably
this caused a temporary object to be constructed and then moved into the
variable being initialized, but gcc 8 complains about ambiguity.
Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121736.12136-1-avi@scylladb.com>
The parameters to the MutationFragmentConsumer concept must be concrete
types, not decltype(auto).
Reported by gcc 8.
Message-Id: <20180421110738.7574-1-avi@scylladb.com>
"
Prints info about sstables used by readers
Example:
(gdb) scylla active-sstables
sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
sstable_count=2, total_index_lists_size=0
"
* 'tgrabiec/gdb-scylla-active-sstables' of github.com:tgrabiec/scylla:
gdb: Introduce "scylla active-sstables" command
gdb: Make list_unordered_map() more general
gdb: Improve compatibility with python2.7
Prints info about sstables used by readers
Example:
(gdb) scylla active-sstables
sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
sstable_count=2, total_index_lists_size=0
1) vt.name returns None for some types, use str() instead
2) some unorderd_maps use 'false' as the second Hash_node template parameter
3) some consumers will prefer a reference to the value instead of its address
"
Enhance continuous_data_consumer to use existing vint serialization for reading
variant integers from SSTables.
Also available at:
https://github.com/scylladb/seastar-dev/commits/haaawk/sstables3/unsigned-vint-v6
Tests: units (release)
"
* 'haaawk/sstables3/unsigned-vint-v6' of ssh://github.com/scylladb/seastar-dev:
sstables: add test for continuous_data_consumer::read_unsigned_vint
buffer_input_stream: make it possible to specify chunk size
Add tests for make_limiting_data_source
Introduce make_limiting_data_source
sstables: add continuous_data_consumer::read_unsigned_vint
Cover serialized_size_from_first_byte in tests
core: add unsigned_vint::serialized_size_from_first_byte
sstables: add all dependant headers to consumer.hh
sstables: add all dependant headers to exceptions.hh
core: add #pragma once to vint-serialization.hh
When 'always_set_home' is specified on /etc/sudoers pbuilder won't read
.pbuilderrc from current user home directory, and we don't have a way to change
the behavor from sudo command parameter.
So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed,
this can work both environment which does specified always_set_home and doesn't
specified.
Fixes#3366
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com>
This method takes a data_source and returns another data_source
that returns data from the input source but in chunks of limited
size.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This method takes first byte and determins how many bytes
are used to represent an unsigned variant integer.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Before it was depending on byteorder.hh that just happend
to be included in all compilation units that were using consumer.hh
This change makes the header compile when used in new compilation units.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Before it was depending on print.hh that just happend
to be included in all compilation units that were using
exceptions.hh. This change makes the header compile
when used in new compilation units.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Since storage_proxy provides access to the entire cluster, a local shard
reference is sufficient. Adjust query_processor to store a reference to
just the local shard, rather than a seastar::sharded<storage_proxy> and
adjust callers.
This simplifies the code a little.
Message-Id: <20180415142656.25370-3-avi@scylladb.com>
The storage_proxy represents the entire cluster, so there's never a need
to access it on a remote shard; the local shard instance will contact
remote shard or remote nodes as needed.
Simplify the API by passing storage_proxy references instead of
seastar::sharded<storage_proxy> references. query_processor and
other callers are adjusted to call seastar::sharded::local() first.
Message-Id: <20180415142656.25370-2-avi@scylladb.com>
build_deb.sh relies on pbuilder picking up a ~/.pbuilderrc which we
copy from the script. According to the pbuilder manual, "~" will refer
to the root directory (since pbuilder is run via sudo). In practice
we've observed this working with "~" referring to the current user's
home directory, but also sometimes failing, while complaining
about /root/.pbuilderrc failing. When it fails, it fails to set
the correct distribution.
To be extra sure, also copy .pbuilderrc to root's home directory. This
way, whatever behavior pbuilder chooses to follow, it will have a
configuration file to read.
Message-Id: <20180410134508.9415-1-avi@scylladb.com>
The patch fixes a bug introduce by commit
089b54f2d2.
When sstable files are stored in .../upload directory
and refresh is initialised with `nodetool` then it fails
because Scyla doesn't expect .../upload to be a part of the path.
Fixes#3334.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180413132019.17779-1-daniel@scylladb.com>
This commit extends JSON support with toJson() function,
which can be used in SELECT clause to transform a single argument
to JSON form.
toJson() accepts any type including nested collection types,
so instead of being declared with concrete types,
proper toJson() instances are generated during calls.
This commit also supplements JSON CQL query tests with toJson calls.
Finally, it refactors JSON tests so they use do_with_cql_env_thread.
References #2058
Message-Id: <a7833650428e9ef590765a14e91c4d42532588f4.1523528698.git.sarna@scylladb.com>
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();
Fixes: #3355
Tests: unit(release)
Message-Id: <20180412134744.GB22593@scylladb.com>
That's blocking KairosDB users because it uses TWCS with millisecond
timestamp resolution.
Also older drivers use millisecond instead of the default microsecond.
Fixes#3152.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180411171244.19958-1-raphaelsc@scylladb.com>
In my well intentioned attempt to use fewer magic numbers in the loading
code I replaced "64" with something calculated automatically from the
type being used.
Except I did it wrong, because sizeof(uint64_t) is 8, not 64.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411155903.27665-1-glauber@scylladb.com>
"
This series introduces 'SELECT JSON' clause support for CQL.
Things implemented:
* expanding CQL grammar with JSON keyword
* converting values to JSON format
* serving 'SELECT JSON *' clauses
* tests for 'SELECT JSON'
"
* 'json_ops' of https://github.com/psarna/scylla:
tests: add cql unit tests for SELECT JSON
cql3: Add JSON token to CQL grammar
cql3: add support for SELECT JSON clause
cql3: add to_json_string function to types
This commit adds JSON keyword to CQL grammar and allows parsing
'SELECT JSON' command in CQL. Additionally, it will be useful
in implementing 'INSERT JSON(...)'.
References #2058
This commit adds the implementation of SELECT JSON clause
which returns rows in JSON format. Each returned row has a single
'[json]' column.
References #2058
"
The multishard combined reader provides a convenient
flat_mutation_reader implementation that takes care of efficiently
reading a range from all shards that own data belonging to the range.
All this happens transparently, the user of the reader need only pass a
factory function to the multishard reader which it uses to create
remote readers when needed. These remote readers will then be managed
through foreign reader which abstracts away the fact that the reader is
located on a remote shard.
Sub readers are created for the entire read range, meaning they are free
to cross shard-range limits to fill their buffer. The output of these
sub readers is merged in a round-robin manner, the same way data is
distributed among shards. The multishard reader will move to the next
shard's reader whenever it encounters a partition whose token is after
the delimiter token.
To improve throughput and latency two levels of read-ahead is employed.
One in foreign_reader, which will try to fill the remote shard reader's
buffer in the background, in parallel to processing the results on the
local shard. And one in the multishard reader itself which will
exponentially increase concurrency whenever a sub-reader's buffer
becomes empty. But only if this happened after crossing a shard
boundary. This is important because there is no point in increasing
concurrency if a single sub reader can fill the multishard readers'
buffer.
"
* 'multishard-reader/v3' of https://github.com/denesb/scylla:
Add unit tests for multishard_combined_reader
Add multishard_combined_reader
flat_mutation_reader: add peek_buffer()
Add unit tests for foreign_reader
forwardable reader: implement fast_forward_to(position_in_partition)
Add foreign_reader
flat_mutation_reader: add detach_buffer()
Asias reported in issue #3351 that a floating point exception was seen
while loading SSTables. Looking at the trace, that seems to be because
we tried to issue a modulo operation with something that was likely 0.
That field comes from the nr_bits attribute in the large bitset, and our
current code should set it to whatever we read from the Filter file -
something that has been working for ages.
The difference is that after the patch that Asias identified as culprit,
we are moving the array from which we compute the size in the same
parameter list where we are computing the size.
This works for me and passed all my tests - likely because my compiler
was doing left-to-right evaluation as I would expect it to do. But the
standard doesn't guarantee that at all, and it reads:
"Order of evaluation of the operands of almost all C++ operators
(including the order of evaluation of function arguments in a
function-call expression and the order of evaluation of the
subexpressions within any expression) is unspecified. The compiler can
evaluate operands in any order, and may choose another order when the
same expression is evaluated again."
This likely fixes the bug, but even if it doesn't we should patch it,
since we currently have something that is technically an UB.
Fixes#3351.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411144036.24748-1-glauber@scylladb.com>
The patch fixes a bug introduce by commit 089b54f2d2.
This bug exhibited when master was deployed in an attempt to populate
materialised views. The nodes restarted in the middle and they were not able
to come back.
The fix is to remember formats and versions of sstables for every generation.
Fixes: #3324.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180410083114.17315-1-daniel@scylladb.com>
This commit adds a 'to_json_string' method which will be used
for converting values to JSON strings. In several cases it's not
sufficient to use 'to_string', e.g. actual strings need to be
surrounded with double quotes.
References #2058
Some tests escaped the --overprovisioned flag, causing them to
compete over cpu 0. Add the flag to all tests.
Message-Id: <20180410181606.8341-1-avi@scylladb.com>
Takes care of reading a range from all shards that own a subrange in the
range. The read happens sequentially, reading from one shard at a time.
Under the scenes it uses combined_mutation_reader and foreign_reader,
the former providing the merging logic and the latter taking care of
transferring the output of the remote readers to the local shard.
Readers are created on-demand by a reader-selector implementation that
creates readers for yet unvisited shards as the read progresses.
The read starts with a concurrency of one, that is the reader reads from
a single shard at a time. The concurrency is exponentially increased (to
a maximum of the number of shards) when a reader's buffer is empty after
moving the next shard. This condition is important as we only wan't to
increase concurrency for sparse tables that have little data and the
reader has to move between shards often. When concurrency is > 1, the
reader issues background read-aheads to the next shards so that by the
time it needs to move to them they have the data ready.
For dense tables (where we rarely cross shards) we rely on the
foreign_reader to issue sufficient read-aheads on its own to avoid
blocking.
Allows peeking at the next mutation fragment in the buffer. As opposed
to the existing `peek()` it assumes there's at least one fragment in the
buffer. Useful for code that already ensured that the buffer is not
empty and doesn't want to introduce a continuation (via `peek()`).
Instead of throwing std::bad_function_call. Needed by the foreign_reader
unit test. Not sure how other tests didn't hit this before as the test
is using `run_mutation_source_tests()`.
Local representant of a reader located on a remote shard. Manages the
lifecycle and takes care of seamlessly transferring fragments produced
by the remote reader. Fragments are *copied* between the shards in
batches, a bufferful at a time.
To maximize throughput read-ahead is used. After each fill_buffer() or
fast_forward_to() a read-ahead (a fill_buffer() on the remote reader) is
issued. This read-ahead runs in the background and is brough back to
foreground on the next fill_buffer() or fast_forward_to() call.
Allows for detaching the internal buffer of the reader. Enables
convenient transferring of buffered fragmends in a single batch but
will force the reader to reallocate it's buffer on the next
fill_buffer() call.
Introduced for foreign_reader which favours quick transferring of the
fragments between shards in a single batch, over minimizing allocations,
which can be amortized by background read-aheads.
After change to serialize compaction on compaction weight (eff62bc61e),
LCS invariant may break because parallel compaction can start, and it's
not currently supported for LCS.
The condition is that weight is deregistered right before last sstable
for a leveled compaction is sealed, so it may happen that a new compaction
starts for the same column family meanwhile that will promote a sstable to
an overlapping token range.
That leads to strategy restoring invariant when it finds the overlapping,
and that means wasted resources.
The fix is about removing a fast path check which is incorrect now because
we release weight early and also fixing a check for ongoing compaction
which prevented compaction from starting for LCS whenever weight tracker
was not empty.
Fixes#3279.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180410034538.30486-1-raphaelsc@scylladb.com>
_lsa_managed is always 1:1 with _region, so we can remove it, saving
some space in the segment descriptor vector.
Tests: unit (release), logalloc_test (debug)
Message-Id: <20180410122606.10671-1-avi@scylladb.com>
Problem:
Start node 1 2 3
Shutdown node2
Shutdown node1 node3
Start node1 node3
Try to repalce_address for node 2
The replace operation fails with the error:
seastar - Exiting on unhandled exception: std::runtime_error
(Cannot replace_address node2 because it doesn't exist in gossip)
This is because after all nodes shutdown, the other nodes do not have the
tokens and host_id info of node2 until node2 boots up and talks to the cluster.
If node2 can not boots up for whatever reason, currently the only way to
recover node2 is to `nodetool removenode` and bootstrap node2 again. This will
change tokens in the cluster and cause more data movement than just replacing
node2.
To fix, we add the tokens and host_id gossip application state in add_saved_endpoint
during boot up.
This is pretty safe because the generation for application state added by
add_saved_endpoint is zero, if node2 actually boots, other nodes will update
with node2's version.
Before:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"generation": 0,
"is_alive": false,
"update_time": 1523344828953,
"version": 0
}
Node 2 can not be replaced.
After:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"application_state": [
{
"application_state": 12,
"value": "31284090-2557-4036-9367-7bb4ef49c35a",
"version": 2
},
{
"application_state": 13,
"value": "... a lot of tokens ...",
"version": 1
}
],
"generation": 0,
"is_alive": false,
"update_time": 1523344828953,
"version": 0
}
Node 2 can be replaced.
Tests: dtest/replace_address_test.py
Fixes: #3347
Message-Id: <117fd6649939e0505847335791be8d7a96e7d273.1523346805.git.asias@scylladb.com>
save and load functions for the large_bitset were introduced by Avi with
d590e327c0.
In that commit, Avi says:
"... providing iterator-based load() and save() methods. The methods
support partial load/save so that access to very large bitmaps can be
split over multiple tasks."
The only user of this interface is SSTables. And turns out we don't really
split the access like that. What we do instead is to create a chunked vector
and then pass its begin() method with position = 0 and let it write everything.
The problem here is that this require the chunked vector to be fully
initialized, not just reserved. If the bitmap is large enough that in itself
can take a long time without yielding (up to 16ms seen in my setup).
We can simplify things considerably by moving the large_bitset to use a
chunked vector internally: it already uses a poor man's version of it
by allocating chunks internally (it predates the chunked_vector).
By doing that, we can turn save() into a simple copy operation, and do
away with load altogether by adding a new constructor that will just
copy an existing chunked_vector.
Fixes#3341
Tests: unit (release)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180409234726.28219-1-glauber@scylladb.com>
"This patchset removes zones and replaces them with a simpler system. LSA tries
to allocate segments at higher addresses, so that we'll end up with the standard
allocator using lower addresses and LSA using higher addresses, allowing for easier
allocation from std."
* tag 'lsa-no-zones/v6' of https://github.com/avikivity/scylla:
tests: add logalloc_test for large contiguous allocations in a challenging environemnt
logalloc: limit std segment allocations in debug mode
logalloc: introduce prime_segment_pool()
logalloc: limit non-contiguous reclaims
logalloc: pre-allocate all memory as lsa on startup
tests: add random test for dynamic_bitset
dynamic_bitset: optimize for large sets
dynamic_bitset: get rid of resize()
dynamic_bitset: remove find_*_clear() variants
logalloc: reduce segment size to 128k
logalloc: get rid of the emergency reserve stack
logalloc: replace zones with segment-at-a-time alloc/free
Commit 1671d9c433 (not on any release branch)
accidentally bumped the idle memtable flush cpu shares to 100 (representing
10%), causing flushes to be too when they don't comsume too much cpu.
Fixes#3243.
Message-Id: <20180408104601.9607-1-avi@scylladb.com>
* seastar 33d8f74...2da7d46 (4):
> http routes: Add parameters to path when adding alias
> future: compile-time optimize futurize<void>::apply()
> memory: remove unneeded union 'pla'
> queue: not_empty()/not_full() should throw when called after abort
This state does not read any data and is used only to perform
action when finishing to read a primitive type.
According to comment on continuous_data_consumer::non_consuming
such states should be marked as non_consuming.
Tests: units (release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <55a5c9b76268b50312ecd044291f28dcd8179a22.1523005293.git.piotr@scylladb.com>
Test large std allocations in an evironement that has seen many persistent
std allocations interspersed with lsa allocations, causing memory fragmentation.
Address Sanitizer has a global limit on the number of allocations
(note: not number of allocations less number of frees, but cumulative
number of allocations). Running some tests in debug mode on a machine
with sufficient memory can break that limit.
Work around that limit by restricting the amount of memory the
debug mode segment_pool can allocate. It's also nicer for running
the test on a workstation.
To segregate std and lsa allocations, we prime the segment pool
during initialization so that lsa will release lower-addressed
memory to std, rather than lsa and std competing for memory at
random addresses.
However, tests often evict all of lsa memory for their own
purposes, which defeats this priming.
Extract the functionality into a new prime_segment_pool()
function for use in tests that rely on allocation segregation.
We may fail to reclaim because a region has reclaim disabled (usually because
it is in an allocating_section. Failed reclaims can cause high CPU usage
if all of the lower addresses happen to be in a reclaim-disabled region (this
is somewhat mitigated by the fact that checking for reclaim disabled is very
cheap), but worse, failing a segment reclaim can lead to reclaimed memory
being fragmented. This results in the original allocation continuing to fail.
To combat that, we limit the number of failed reclaims. If we reach the limit,
we fail the reclaim. The surrounding allocating_section will release the
reclaim_lock, and increase reserves, which will result in reclaim being
retried with all regions being reclaimable, and succeed in allocating
contiguous memory.
Since lsa tries to keep some non-lsa memory as reserve, we end up
with three blocks of memory: at low addresses, non-lsa memory that was
allocated during startup or subsequently freed by lsa; at middle addresses,
lsa; and at the top addresses, memory that lsa left alone during initial
cache population due to the reserve.
After time passes, both std and lsa will allocate from the top section,
causing a mix of lsa and non-lsa memory. Since lsa tries to free from
lower addresses, this mix will stay there forever, increasing fragmentation.
Fix that by disabling the reserve during startup and allocating all of memory
for lsa. Any further allocation will then have to be satisfied by lsa first
freeing memory from the low addresses, so we will now have just two sections
of memory: low addresses for std, and top addresses for lsa.
Note that this startup allocation does not page in lsa segments, since the
segment constructor does not touch memory.
They are no longer used, and cannot be efficiently implemenented
for large bitsets using a summary vector approach without slowing
down the find_*_set() variants, which are used.
Also remove find_previous_set() for the same reason.
Reducing the segment size reduces the time needed to compact segments,
and increases the number of segments that can be compacted (and so
the probability of finding low-occupancy segments).
128k is the size of I/O buffers and of thread stacks, so we can't
go lower than that without more significant changes.
This patch replaces the zones mechanism with something simpler: a
single segment is moved from the standard allocator to lsa and vice
versa, at a time. Fragmentation resistance is (hopefully) achieved
by having lsa prefer high addresses for lsa data, and return segments
at low address to the standard allocator. Over time, the two will move
apart.
Moving just once segment at a time reduces the latency costs of
transferring memory between free and std.
"Together with the already merged patch, we reduce the object file
from 114MB to 81MB."
* tag 'api-diet-1/v1' of https://github.com/avikivity/scylla:
api: type-erase all-column_family map_reduce variant
api: simplify 6-argument map_reduce_cf() variant
* seastar 7328d17...33d8f74 (3):
> memory: switch to buddy allocation
> tls: Ensure we always pass through semaphores on shutdown
> memory: replace placement-new in unions with member construction
See scylladb/seastar#426.
After f59f423f3c, sstable is loaded only at shards
that own it so as to reduce the sstable load overhead.
The problem is that a sstable may no longer be forwarded to a shard that needs to
be aware of its existence which would result in that sstable generation being
reallocated for a write request.
That would result in a failure as follow:
"SSTable write failed due to existence of TOC file for generation..."
This can be fixed by forwarding any sstable at load to all its owner shards
*and* the shard responsible for its generation, which is determined as follow:
s = generation % smp::count
Fixes#3273.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180405035245.30194-1-raphaelsc@scylladb.com>
"
Fixes to the view building process, discovered from field experience.
Tests: dtest(materialized_view_tests.py, smp=2)
"
* 'views/view-build-fixes/v1' of https://github.com/duarten/scylla:
db/view: Start view building after schema agreement
db/system_keyspace: scylla_views_builds_in_progress writes are user mem
db/view: Require configuration option to enable view building
Empty partition keys are not supported on normal tables - they cannot
be inserted or queried (surprisingly, the rules for composite
partition keys are different: all components are then allowed to be
empty). However, the (non-composite) partition key of a view could end
up being empty if that column is: a base table regular column, a
base table clustering key column, or a base table partition key column,
part of a composite key.
Fixes#3262
Refs CASSANDRA-14345
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-1-duarte@scylladb.com>
If a base table or view has been dropped in one node, but another
one hasn't yet learned about it, it starts the view build process
immediately on boot, possibly calculating unneeded view updates and
causing errors at the view replica, if that replica has already
processed the schema changes. We should thus wait for schema
agreement, even if the node is a seed.
Fixes#3328
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Treat writes to scylla_views_builds_in_progress as user memory, as the
number of writes is dependent on the amount of user data on views
(times the number of views, divided by the view building batch size).
Fixes#3325
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View building, enabled by default, can contain or expose issues that
prevent the node from starting. In those cases, it is necessary to
disable view building such that the node can be submitted to
maintenance operations.
Fixes#3329
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The 6-argument map_reduce_cf function is identical to the 5-argument
version, except that it applies performs an extra cast (by calling
the 6th argument's operator=()).
Simplify the code by calling the 5-argument version from the 6-argument
version.
Reduces binary size by ~10%.
map_reduce_cf() is called with varying template parameters which each
have to be compiled separately. Unifying the internals to use types based
on std::any reduced the object size by 15% (115MB->99MB) with presumably
a commensurate decrease in compile time.
A version that used "I" instead of "std::any" (and thus merged the
internals only for callers that used the same result type) delivered
a 10% decrease in object size. While std::any is less safe, in this
case it is completely encapsulated.
Message-Id: <20180402213732.432-1-avi@scylladb.com>
test_large_allocation attempts to allocate almost half of memory.
With a buddy allocator, even if more than half of memory is free,
and even if it is contiguous, it is unlikely to be available as a
single allocation because the allocator inserts boundaries at powers-
of-two addresses.
Relax the test by allocating smaller chunks (but still the same amount,
and still with challenging sizes); allocating half of memory contiguously
is not a goal.
Also use a vector instead of a deque, and reserve it, so we don't get
intervening non-lsa allocations. I'm not sure there's a problem there
but let's not depend on the allocation patterns.
Message-Id: <20180401150828.13921-1-avi@scylladb.com>
While building with -O1, I saw that the linker could not find
the vtable for named_value<log_level>. Rather than fixing up the
includes (and likely lengthening build time), fix by defining
the class as an extern template, preventing it from being
instantiated at the call site.
Message-Id: <20180401150235.13451-1-avi@scylladb.com>
-O1 complains that client_state::_remote_addr is not initialized
(and it is right). The call site is tracing, which likely won't be
invoked for internal queries, but still.
Message-Id: <20180401150410.13651-1-avi@scylladb.com>
Since bytes is used to encapsulate blobs, not strings, there's no
need for a NUL terminator. It will never be passed to a function
that expects a C string.
Message-Id: <20180401151009.14108-1-avi@scylladb.com>
* seastar a66cc34...7328d17 (5):
> sstring: add support for non-nul-terminated sstrings
> core/sharded: Make async_sharded_service dtor virtual
> reactor: pass naked pointer to submit_io
> Merge http: "Add alias support to the API" from Amnon
> systemwide_memory_barrier: use madvise(MADV_DONTNEED) instead of mprotect()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
By default, overprovisioned is not enabled on docker unless it is
explicitly set. I have come to believe that this is a mistake.
If the user is running alone in the machine, and there are no other
processes pinned anywhere - including interrupts - not running
overprovisioned is the best choice.
But everywhere else, it is not: even if a user runs 2 docker containers
in the same machine and statically partitions CPUs with --smp (but
without cpuset) the docker containers will pin themselves to the same
sets of CPU, as they are totally unaware of each other.
It is also very common, specially in some virtualized environments, for
interrupts not to be properly distributed - being particularly keen on
being delivered on CPU0, a CPU which Scylla will pin by default.
Lastly, environments like Kubernetes simply don't support pinning at the
moment.
This patch enables the overprovisioned flag if it is explicitly set -
like we did before - but also by default unless --cpuset is set.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180331142131.842-1-glauber@scylladb.com>
Unused options are not exposed as command line options and will prevent
Scylla from booting when present, although they can still be pased over
YAML, for Cassandra compatibility.
That has never been a problem, but we have been adding options to i3
(and others) that are now deprecated, but were previously marked as
Used. Systems with those options may have issues upgrading.
While this problem is common to all Unused options, the likelihood for
any other unused option to appear in the command line is near zero,
except for those two - since we put them there ourselves.
There are two ways to handle this issue:
1) Mark them as Used, and just ignore them.
2) Add them explicitly to boost program options, and then ignore them.
The second option is preferred here, because we can add them as hidden
options in program_options, meaning they won't show up in the help. We
can then just print a discrete message saying that those options are,
for now on ignored.
v2: mark set as const (Botond)
v3: rebase on top of master, identation suggested by Duarte.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180329145517.8462-1-glauber@scylladb.com>
This reverts commit 3b53f922a3. It's broken
in two ways:
1. concrete_allocating_function::allocate()'ss caller,
region_group::start_releaser() loop, will delete the object
as soon as it returns; however we scheduled some work depending
on `this` in a separate continuation (via with_scheduling_group())
2. the calling loop's termination condition depends on the work being
done immediately, not later.
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node
I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:
On node1, node3
[shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704
On node2
[shard 0] storage_service - JOINING: waiting for schema information to complete
This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.
gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();
Each force_newer_generation_unsafe increases the generation by 1.
Here is an example,
Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"generation": 0,
"is_alive": false,
"update_time": 1521778757334,
"version": 0
},
```
After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"application_state": [
{
"application_state": 0,
"value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
"version": 214
},
{
"application_state": 6,
"value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
"version": 153
}
],
"generation": 2,
"is_alive": false,
"update_time": 1521779276246,
"version": 0
},
```
In gossiper::apply_state_locally, we have this check:
```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
// assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);
}
```
to skip the gossip update.
To fix, we relax generation max difference check to allow the generation
of a removed node.
After this patch, the removed node bootstraps successfully.
Tests: dtest:update_cluster_layout_tests.py
Fixes#3331
Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
Just saw this today during a crash when creating Materialized Views.
It is still unclear why this happened. But the message says:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: scylla: sstables/sstables.cc:2973: sstables::sstable::remove_sstable_with_temp_toc(seastar::sstring, seastar::sstring, seastar::sstring, int64_t, sstables::sstable::version_types, sstables::sstable::format_types)::<lambda()>: Assertion `tmptoc == true' failed.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Aborting on shard 0.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Backtrace:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4b4c
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4df5
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4ea3
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libpthread.so.0+0x000000000000f0ff
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x00000000000355f6
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x0000000000036ce7
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e565
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e611
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000015969d0
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x0000000001596f7a
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x000000000051ca8d
I can't even guess which table caused the problem, let alone which SSTable.
That's because those asserts are the very first thing we do. We can discuss
whether or not assert is the right behaviour (usually we can't guarantee the
state is sane if that is missing, so I don't see a problem)
But it would be nice to see which SSTable we are processing before we assert.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180328160856.10717-1-glauber@scylladb.com>
"
The configuration API is part of scylla v2 configuration.
It uses the new definition capabilities of the API to dynamically create
the swagger definition for the configuration.
This mean that the swagger will contain an entry with description and
type for each of the config value.
To get the v2 of the swager file:
http://localhost:10000/v2
If using with swagger ui, change http://localhost:10000/api-doc to http://localhost:10000/v2
It takes longer to load because the file is much bigger now.
"
* 'amnon/config_api_v5' of github.com:scylladb/seastar-dev:
Explanation about the API V2
API: add the config API as part of the v2 API.
Defining the config api
The config API is created dynamically from the config. This mean that
the swagger definition file will contain the description and types based on the
configuration.
The config.json file is used by the code generator to define a path that is
used to register the handler function.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"
This series introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.
The view_builder uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.
We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the view building process.
We employ a flat_mutation_reader for each base table for which we're
building views.
We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.
We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.
Interaction with the system tables:
- When we start building a view, we add an entry to the
scylla_views_builds_in_progress system table. If the node restarts
at this point, we'll consider these newly inserted views as having
made no progress, and we'll treat them as new views;
- When we finish a build step, we update the progress of the views
that we built during this step by writing the next token to the
scylla_views_builds_in_progress table. If the node restarts here,
we'll start building the views at the token in the next_token
column.
- When we finish building a view, we mark it as completed in the
built views system table, and remove it from the in-progress system
table. Under failure, the following can happen:
* When we fail to mark the view as built, we'll redo the last
step upon node reboot;
* When we fail to delete the in-progress record, upon reboot
we'll remove this record.
A view is marked as completed only when all shards have finished
their share of the work, that is, if a view is not built, then all
shards will still have an entry in the in-progress system table;
- A view that a shard finished building, but not all other shards,
remains in the in-progress system table, with first_token ==
next_token.
Interaction with the distributed system tables:
- When we start building a view, we mark the view build as being
in-progress;
- When we finish building a view, we mark the view as being built.
Upon failure, we ensure that if the view is in the in-progress
system table, then it may not have been written to this table. We
don't load the built views from this table when starting. When
starting, the following happens:
* If the view is in the system.built_views table and not the
in-progress system table, then it will be in this one;
* If the view is in the system.built_views table and not in
this one, it will still be in the in-progress system table -
we detect this and mark it as built in this table too,
keeping the invariant;
* If the view is in this table but not in system.built_views,
then it will also be in the in-progress system table - we
don't detect this and will redo the missing step, for
simplicity.
View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.
When building view updates, we consider that everything is new and
nothing pre-existing is there (which means no tombstones will be sent
out to the paired view replicas).
Tests:
unit (debug)
dtest (materialized_view_test.py(smp=1, smp=2))
"
* 'view-building/v4' of https://github.com/duarten/scylla: (22 commits)
tests/view_build_test: Add tests for view building
tests/cql_test_env: Move eventually() to this file
tests/cql_assertions: Assert result set is not empty
tests/cql_test_env: Start the view_builder
db/view/view_builder: Allow synchronizing with the end of a build
db/view/view_builder: Actually build views
flat_mutation_reader: Make reader from mutation fragments
db/view/view_builder: React to schema changes
service/migration_listener: Add class for view notifications
db/view: Introduce view_builder
column_family: Add function to populate views
column_family: Allow synchronizing with in-progress writes
database: Compare view id instead of name in find_views()
database: Add get_views() function
db/view: Return a future when sending view updates
service/storage_service: Allow querying the view build status
db: Introduce system_distributed_keyspace
tests: Add unit test for build_progress_virtual_reader
db/system_keyspace: Add API for MV-related system tables
db/system_keyspace: Add virtual reader for MV in-progress build status
...
This is a separate file from view_schema_test because that one is
already becoming too long to run; also, having multiple test files
means they can be executed in parallel.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Intended for use by unit tests, this patch allows synchronizing with
the end of a build for a particular view.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the missing view building code to the eponymous class.
We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.
We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Builds a reader from a set of ordered mutations fragments. This is
useful for building a reader out of a subset of segments returned by a
different reader. It is equivalent to building a mutation out of the
set of mutation fragments, and calling
make_flat_mutation_reader_from_mutations, except that it doest not yet
support fast-forwarding.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The view_builder now uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.
We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the upcoming view building
code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Add a convenience base class for view notifications, which provides
a default implementation for all other types of notifications.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.
This patch introduces only the bootstrap functionality, which is
responsible for loading the data stored in the system tables and
filling the in-memory data structures with the relevant information,
to be used in subsequent patches for the actual view building. The
interaction with the system tables is as follows.
Interaction with the tables in system_keyspace:
- When we start building a view, we add an entry to the
scylla_views_builds_in_progress system table. If the node restarts
at this point, we'll consider these newly inserted views as having
made no progress, and we'll treat them as new views;
- When we finish a build step, we update the progress of the views
that we built during this step by writing the next token to the
scylla_views_builds_in_progress table. If the node restarts here,
we'll start building the views at the token in the next_token
column.
- When we finish building a view, we mark it as completed in the
built views system table, and remove it from the in-progress system
table. Under failure, the following can happen:
* When we fail to mark the view as built, we'll redo the last
step upon node reboot;
* When we fail to delete the in-progress record, upon reboot
we'll remove this record.
A view is marked as completed only when all shards have finished
their share of the work, that is, if a view is not built, then all
shards will still have an entry in the in-progress system table;
- A view that a shard finished building, but not all other shards,
remains in the in-progress system table, with first_token ==
next_token.
Interaction with the distributed system table (view_build_status):
- When we start building a view, we mark the view build as being
in-progress;
- When we finish building a view, we mark the view as being built.
Upon failure, we ensure that if the view is in the in-progress
system table, then it may not have been written to this table. We
don't load the built views from this table when starting. When
starting, the following happens:
* If the view is in the system.built_views table and not the
in-progress system table, then it will be in view_build_status;
* If the view is in the system.built_views table and not in
this one, it will still be in the in-progress system table -
we detect this and mark it as built in this table too,
keeping the invariant;
* If the view is in this table but not in system.built_views,
then it will also be in the in-progress system table - we
don't detect this and will redo the missing step, for
simplicity.
View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The populate_views() function takes a set of views to update, a
tokento select base table partitions, and the set of sstables to
query. This lays the foundation for a view building mechanism to exist,
which walks over a given base table, reads data token-by-token,
calculates view updates (in a simplified way, compared to the existing
functions that push view updates), and sends them to the paired view
replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a mechanism to class column_family through which we
can synchronize with in-progress writes. This is useful for code that,
after some modification, needs to ensure that new writes will see it
before it can proceed.
In particular, this will be used by the view building code, which needs
to wait until the in-progress writes, which may have missed that there
is now a view, is observable to the view building code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
While we now send view mutations asynchronously in the normal view
write path, other processes interested in sending view updates, such
as streaming or view building, may wish to do it synchronously.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.
A view can be absent from the result, successfully built, or
currently being built.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces a distributed system keyspace, used to hold
system tables that need to be replicated across a set of replicas
(that is, can't use the LocalStrategy).
In following patches, we will use this keyspace to hold a table
containing view building status updates for each node, used to support
range movements and a new nodetool command.
Fixes#3237
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements an API to access the MV-related system tables,
which pertain to the view building process.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Provide a virtual reader so users can query the in-progress view table
in a way compatible with Apache Cassandra.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When building a materialized view, we divide our work by shard, so we
need to register which shard did what work in the in-progress system
table. We also add the token we started at, which will enable some
optimizations in the view building code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Additional extension points.
* Allows wrapping commitlog file io (including hinted handoff).
* Allows system schema modification on boot, allowing extensions
to inject extensions into hardcoded schemas.
Note: to make commitlog file extensions work, we need to both
enforce we can be notified on segment delete, and thus need to
fix the old issue of hard ::unlink call in segment destructor.
Segment delete is therefore moved to a batch routine, run at
intervals/flush. Replay segments and hints are also deleted via
the commitlog object, ensuring an extension is notified (metadata).
Configurable listeneres are now allowed to inject configuration
object into the main config. I.e. a local object can, either
by becoming a "configurable" or manually, add references to
self-describing values that will be parsed from the scylla.yaml
file, effectively extending it.
All these wonderful abstractions courtesy of encryption of course.
But super generalized!
"
* 'calle/commitlog_ext' of github.com:scylladb/seastar-dev:
db::extensions: Allow extensions to modify (system) schemas
db::commitlog: Add commitlog/hints file io extension
db::commitlog: Do segment delete async + force replay delete go via CL
main/init: Change configurable callbacks and calls to allow adding opts
util::config_file: Add "add" config item overload
Allows extensions/config listeners to potentially augument
(system) schemas at boot time. This is only useful for schemas
who do not pass through system_schema tables.
Refs #2858
Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.
Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
Since we just keep retrying, this can cause Scylla to not shutdown for
a while.
The data will be safe in the commit log.
Note that this patch doesn't fix the issue when shutdown goes through
storage_service::drain_on_shutdown - more work is required to handle
that case.
Ref #3318.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-3-duarte@scylladb.com>
In column_family::try_flush_memtable_to_sstable, the handle_exception()
block is on the inside of the continuations to
write_memtable_to_sstable(), which, if it fails, will leave the
sstable in the compaction_backlog_tracker::_ongoing_writes map, which
will waste disk space, and that sstable will map to a dangling pointer
to a destroyed database_sstable_write_monitor, which causes a seg
fault when accessed (for example, through the backlog_controller,
which accounts the _ongoing_writes when calculating the backlog).
Fix this by increasing the scope of handle_exception().
Fixes#3315
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-2-duarte@scylladb.com>
* Don't dump output of failed tests immediately, print the output
for failed tests in the end instead.
* Fix exception printing in run_test(): don't assume passed in error
object is a `bytes` (or bytes-like) object, call the object's str
operator instead and let callers encode bytes objects instead.
* Don't assume Exception object has an `out` member, use operator str
instead to convert it to string.
* Don't print progress in run_test() directly because it results in
incomprehensible output as the executors race to print to stdout. Leave
progress report to the caller who can serialize progress prints.
* Automatically detect non-tty stdout and don't try to edit already
printed text.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7bb7e0003ded9b28710250bff851ea849bb99f7d.1522062795.git.bdenes@scylladb.com>
"
This series does not add or change any features of access-control and
roles, but addresses some bugs and finalizes the switch to roles.
"auth: Wait for schema agreement" and the patch prior help avoid false
negatives for integration tests and error messages in logs.
"auth: Remove ordering dependence" fixes an important bug in `auth` that
could leave the default superuser in a corrupted state when it is first
created.
Since roles are feature-complete (to the best of the author's knowledge
as of this writing), the final patch in the series removes any warnings
about them being unimplemented.
Tests: unit (release), dtest (PENDING)
"
* 'jhk/auth_fixes/v1' of https://github.com/hakuch/scylla:
Roles are implemented
auth: Increase delay before background tasks start
auth: Remove ordering dependence
auth: Don't warn on rescheduled task
auth: Wait for schema agreement
Single-node clusters can agree on schema
I've observed failures due to "missing" the peer nodes by about 1
second. Adding 5 second to the existing delay should cover most false
negative test results.
Fixes#3320.
If `auth::password_authenticator` also creates `system_auth.roles` and
we fix the existence check for the default superuser in
`auth::standard_role_manager` to only search for the columns that it
owns (instead of the column itself), then both modules' initialization
are independent of one another.
Fixes#3319.
Apache Cassandra also prints at the `info` level. This change prevents
tasks which we expect to be rescheduled from failing tests and scaring
users.
A good example of this importance of this change is when queries with a
quorum consistency level (for the default superuser) fail because a
quorum is not available. We will try again in this case, and this should
not cause integration tests to fail.
Some modules of `auth` create a default superuser if it does not already
exist.
The existence check is through a SELECT query with quorum consistency
level. If the schema for the applicable tables has not yet propagated to
a peer node at the time that it processes this query, then the
`storage_proxy` will print an error message to the log and the query
will be retried.
Eventually, the schema will propagate and the default superuser will be
created. However, the error message in the log causes integration tests
to fail (and is somewhat annoying).
Now, prior to querying for existing data, we wait for all gossip peers
to have the same schema version as we do.
Fixes#2852.
At some points while bootstrapping [1], new non-seed Scylla nodes wait
for schema agreement among all known endpoints in the cluster.
The check for schema agreement was in
`service::migration_manager::is_ready_for_bootstrap`. This function
would return `true` if, at the time of its invocation, the node was
aware of at least one `UP` peer (not itself) and that all `UP` peers had
the same schema version as the node.
We wish to re-use this check in the `auth` sub-system to ensure that
the schema for internal system tables used for access-control have
propagated to the entire cluster.
Unlike in `service/storage_service.cc`, where `is_ready_for_bootstrap`
was only invoked for seed nodes, we wish to wait for schema agreement
for all nodes regardless of whether or not they are seeds.
For a single-node cluster with itself as a seed,
`is_ready_for_bootstrap` would always return `false`.
We therefore change the conditions for schema agreement. Schema
agreement is now reached when there are no known peers (so the endpoint
map of the gossiper consists only of ourselves), or when there is at
least one `UP` peer and all `UP` peers have the same schema version as
us.
This change should not impact any bootstrap behavior in
`storage_service` because seed nodes do not invoke the function and
non-seed nodes wait for peer visibility before checking for schema
agreement.
Since this function is no longer checking for schema agreement only in
the context of bootstrapping non-seed nodes, we rename it to reflect its
generality.
[1] http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
I see the following error:
seastar/core/future-util.hh:597:10: note: constraints not satisfied
seastar/core/future-util.hh:597:10: note: with ‘sstables::sstable_version_types* c’
seastar/core/future-util.hh:597:10: note: with ‘sub_partitions_read::run_test_case()::<lambda(sstables::sstable::version_types)> aa’
seastar/core/future-util.hh:597:10: note: the required expression ‘seastar::futurize_apply(aa, (* c.begin()))’ would be ill-formed
seastar/core/future-util.hh:597:10: note: ‘seastar::futurize_apply(aa, (* c.begin()))’ is not implicitly convertible to ‘seastar::future<>’
The C array all_sstable_versions decayed to a pointer (see second gcc note)
and of course doesn't support std::begin().
Fix by replacing the C array with an std::array<>, which supports std::begin().
Not clear what made this break again, or why it worked before.
Message-Id: <20180325095239.12407-1-avi@scylladb.com>
"
This patchset removes unneeded object files from the test link,
reducing unnecessary links and reducing link time and executable
size.
Tests: build (release)
"
* tag 'test-link/v1' of https://github.com/avikivity/scylla:
build: link release.o into scylla and perf_fast_forward binaries only
build: don't link api/ into tests
release.o depends on the release date and git hash, and therefore changes
every time ./configure.py is executed. In turn, this causes all tests to
relink.
Improve the situation by only linking release.o into binaries that require
it.
This helps continuous integration scripts, which call configure.py
unconditionally. Developers usually won't, so they will not see significant
savings.
Tests: build (release)
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.
Fixes#3304.
"
* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test reads with no clustering ranges and no static columns
tests: simple_schema: Allow creating schema with no static column
clustering_ranges_walker: Stop after static row in case no clustering ranges
When there are no clustering ranges, stop at position which is right
after the static row instead of position which is after all clustered
rows.
This fixes an abort in sstable reader when querying a partition with
no clustering ranges (happens with counter tables) which also doesn't
have any static columns. In such case, the sstable_mutation_reader
will setup the data_consume_context such that it only covers the
static row of the partition, knowing that there is no need to ready
any clustering row. See partition.cc::advance_to_upper_bound(). Later
when we're done with reading the static row (which is absent), we will
try to skip to the first clustering range, which in this case is
missing. If clustering_ranges_walker tells us to skip to
after_all_clustering_rows(), we will hit an asser inside
continuous_data_consumer::fast_forward_to() due to attempt to skip
past the original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine, becuase we end up
at the same data file position.
Fixes#3304.
"
There are a lot of things that we should be grouping in scheduling
groups that we aren't yet. The write path is not tagged at all,
mutation_query isn't either. Some, like streaming, are used - but not in
all places where they are needed.
Tests: unit (release)
"
* 'split-scheduling-groups-v2' of github.com:glommer/scylla:
database: group statements in their own scheduling group
database: apply streaming mutations with streaming priority
logalloc: capture current scheduling group for deferring function
Add a unit test for reproducing issue #2720 (and verifying its fix)
If a user tries to create a view whose primary key is missing any of the
base table's primary key columns, the creation should fail.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-3-nyh@scylladb.com>
Changed the order to check a couple of error conditions *after* checking
for too many or missing primary key columns. This order (showing the
too many or missing key columns first) is more useful, and is the order
in Cassadra's code.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-2-nyh@scylladb.com>
A view's primary key must include all the columns of the base's primary
key. If we don't check this and fail the table's creation, we can discover
problems later on when using the table, as demonstrated in issue #2720.
We had such checking code (translated from the same code in Java) but it
had an extra "else" which caused nothing to be put in "missing_pk_columns"
so the error was never recognized.
Also, when the error does happen, we should print the column's name_as_text(),
not name() which is (surprisingly) just a number.
Fixes#2720.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-1-nyh@scylladb.com>
One of the tests created a base table with 5 primary key columns, but
put only 4 of them in the view. This is not allowed, but prior to fixing
issue #2720 this error was silently ignored. Let's fix the error instead
of relying on this silence.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180321094352.22329-1-nyh@scylladb.com>
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.
To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are flushing the streaming memtables with streaming priority, but
applying the mutations themselves is still done with normal priorities.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we call run_when_memory_available, it is entirely possible that
the caller is doing that inside a scheduling_group. If we don't defer
we will execute correctly. But if we do defer, the current code will
execute - in the future - with the default scheduling group.
This patch fixes that by capturing the caller scheduling group and
making sure the function is executed later using it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
Since f8613a8415 we have reader-caching
on replicas for single-partition queries. This caching works best when
all pages of a query are sent to the same replicas consistently and thus
they can reuse the cached readers there.
The propability-based nature of read-repair works against this as on any
given page a read-repair will be attempted or not based on probability.
This will cause hight drop-rates on the replicas used for read-repair as
the cached reader will not be reusable if the replica was skipped for
one or more pages.
To fix this make the repair-decision once, on the first page of the
query and store the decision in the paging-state. On all remaining
pages of the query use this stored decision.
Tests: unit-tests(release, debug), dtest(paging_advanced_tests.py)
Refs: #1865
"
* 'per_query_repair_decision/v2' of https://github.com/denesb/scylla:
Make the read-repair decision only once
storage_proxy: add coordinator_query_options and coordinator_query_result
Add query_read_repair_decision to paging-state
For several reasons that I cannot fit in the margin, when a view is
created, at most ONE regular column from the base table may be added
to the view's key.
This small new test verifies that if we try to add two columns, the
view creation fails.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235453.1613-1-nyh@scylladb.com>
We had a unit test, test_primary_key_is_not_null, for testing that
we correctly complain - or don't complain - on missing "IS NOT NULL"
restrictions, as expected.
However, this test missed the actual bug we had regarding IS NOT NULL
checking - see issue #2628 - because it thought a silly syntax error
which caused an exception, was the exception we expected to see :-)
So in this patch, I rewrote this test. It fixes the test's bug and
demonstrates issue #2628 (and verifies its fix), and also tests a few
more corner cases.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235000.1399-1-nyh@scylladb.com>
When creating a materialized view, the user must provide a "IS NOT NULL"
restriction for each of the created view's primary columns. If such a
restriction is missing, the view creation should fail. In #2628 we noticed
that sometimes it wasn't failing, but later updates to such table would fail,
which is a bug.
There is actually one special case where "IS NOT NULL" is optional:
It is optional on the base's partition key column (when there is just
one of these) because it is already assumed that the partition key in
its entirety can never be.
Our "IS NOT NULL" test, validate_primary_key(), had two logic errors
which caused it to miss some cases of missing "IS NOT NULL":
1. Instead of checking whether a certain column is a the base's only
partition-key column, and avoid testing IS NOT NULL just for that
specific column, the code tested whether the schema *has* such a
column, and if it did, the test was skipped for all columns.
2. When the code found the one new column in the view's primary key, it
was so happy to find it that it immediately returned, and forgot to
test the IS NOT NULL on that column :-)
Both errors are fixed by this patch.
See the next patch for a unit test.
Fixes#2628.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319233657.522-1-nyh@scylladb.com>
Right now we have yield points between partition processing guaranteed
by the fact that there are .get()s around the code, and those include
an yield point.
We have been discussing removing the implicit yield point from get and
pushing that to the caller. In that spirit, let's yield explicitly here
if needed.
It should be the responsibility of the loop that it doesn't hurt
latency, either by the fact that it is bounded by a small number of
iterations or yields. In other words, that loop should have a yield
point on every iteration (like the non-thread variant does).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180319173051.8918-1-glauber@scylladb.com>
"
These patches add support for C* 2.2 file(name) format.
Namely:
* It forces Scylla to write files in la format.
* Adds storage-service feature for them.
* cf and ks are determined from directory, not from file-name (for 2.2 format).
* Adds some other fixes to make dtest happy.
* Unit tests work with la format or with both formats.
"
* 'danfiala/filename-format-2.2-v4' of https://github.com/hagrid-the-developer/scylla:
tests/sstables: Tests use la format or iterate over both formats.
tests/sstables: Helper functions support 2.2 format directory structure.
stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
storage_service: Support la sstable storage format as a feature.
sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
sstables: Throw more detail exception for unknown item in reverse_map.
sstables/compaction: Suppress NaN in a report of a throughput.
Make the read-repair decision on the first page of a paged-query and use
it for all the remaining pages. This helps querier-cache hit-rates as
reads to nodes will be sent consistently throught the query.
"
Fixes to gossip pertaining to the shadow round.
In particular, an issue preventing a node from being marked as alive is
fixed: After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them marks
them as dead.
If the shadow round failed, after doing the feature checking against the
system tables, we were not clearing the state map and re-adding the
endpoints. This leaves the alive marker set, and prevents
real_mark_alive() from eventually being called.
Fixes#3301
"
* 'gossip/shadow-round-fixes/v3' of https://github.com/duarten/scylla:
gms/gossiper: Remove superfluous check
service/storage_service: Always re-add loaded endpoints
gms/gossiper: Check for shadow round completion before throwing
As yet more parameters and return-values are about to be added to all
storage_proxy::query_* methods we need a way that scales better than
changing the signatures every time. To this end we aggregate all
non-mandatory query parameters into `coordinator_query_options` and all
return values into `coordinator_query_result`.
This way new fields can be simply added to the respective structs while
the signatures of the methods themselves and their client code can
remain unchanged.
This new field will store the repair-decision made on the first page of
the query. This decision will be sticky to all pages of the query.
In mixed clusters the decision might not happen on the first page and it
might even change during the query as old coordinators will not store
nor respect the decision.
After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them
marks them as dead.
If the shadow round failed, after doing the feature checking against
the system tables, we were not clearing the state map and re-adding
the endpoints. This left the alive marker set, and prevented
real_mark_alive() from eventually being called.
Fixes#3301
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patchset reduces the time required to run the tests, mostly by
running them in parallel.
I measured a reduction of 3.5X on a 1s4c4t desktop (release mode).
Tests: unit (release)
* tag 'faster-tests/v2' of https://github.com/avikivity/scylla:
tests: run tests in parallel
tests: simplify timeout handling
tests: don't require crash integrity
tests: allow sharing the machine with other tests
tests: extract seastar options to a separate variable
tests: reduce memory for tests
tests: add "--" unconditionally for boost tests
tests: start cql_test_env without binding to messaging port
storage_service: allow starting gossiper without binding to messaging port
gms: allow gossiper to start_gossiping() without binding to the port
tests: close file correctly in loading_file_test
By using the overprovisioned flag, we reduce polling and pinning, so
less CPU time is wasted and the scheduler has more options to schedule
reactor threads.
Now that we have a minimum boost version, we don't need to check whether
boost requires "--" before test-specific command line arguments. Removing
the check speeds up the test a little.
This is useful in tests, which don't communicate. Binding to a port can
fail if the system is running something else.
It would be better to prevent even more of the gossiper from starting up,
but that is more difficult.
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.
This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map (in all cores).
Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.
Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
Summary has a function, memory_size(), that estimates the amount of
memory the summary takes. It is my understanding that this is called
to serve information to tooling.
First, this function is innacurate because it doesn't take into account
the tokens per each entry, just the keys. But more importantly, it has
to iterate over all keys which can be pretty expensive if the entries
list is long. We are now keeping that in a memory area, with just
pointers in the entry. So instead of iterating through the entries, we
can iterate through the memory areas, which is much cheaper.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180316120915.16809-1-glauber@scylladb.com>
When the cluster is changed (nodes added or removed), ranges of tokens
are moved between nodes. Scylla initiates a streaming process between an
old and a new owner of the range, which can take a long time. During
that streaming time, the new owner of the range is known as a "pending node"
for this range, and all updates must go to both the old owner (in case the
movement fails!) and the pending node (in case the movement succeeds).
For materialized views, because they are ordinary tables, streaming moves
all the view's data that existed before the streaming started. But we did
not send updates done to the view *during* the streaming. A dtest
demonstrates that the new node will miss some of the view update, and will
require a repair of the view tables immediately after the cluster change
ends, which is not good. To fix that, we need to send every new update
that happens during the streaming also to the "pending node". We already
did this properly for base-table updates, but not to the view updates:
Each base table replica wrote to only one paired view table replica,
and nobody wrote to the new pending node (in case where there is one,
for the particular view token involved).
In this patch, we make sure that all view updates go also to the "pending
nodes" when there are any. We do the same thing that Cassandra does, which
is - *all* base replicas write the update to the pending node(s).
Arguably, it is inefficient that all replicas send the update to the same
node. In most cases it is enough to send it from just one base replica -
the one who is slated to be the new node's pair. I opened
https://issues.apache.org/jira/browse/CASSANDRA-14262 about this idea.
But that is an optimization. The patch as-is already fixes the bug.
Fixes#3211
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180313171853.17283-1-nyh@scylladb.com>
The functional change in this series is in the last patch
("auth: Grant all permissions to object creator").
The first patch addresses `const` correctness in `auth`. This change
allowed the new code added in the last patch to be written with the
correct `const` specifiers, and also some code to be removed.
The second-to-last patch addresses error-handling in the authorizer for
unsupported operations and is a prerequisite for the last patch (since
we now always grant permissions for new database objects).
Tests: unit (release)
* 'jhk/default_permissions/v3' of https://github.com/hakuch/scylla:
auth: Grant all permissions to object creator
auth: Unify handling for unsupported errors
auth: Fix life-time issue with parameter
auth: Fix `const` correctness
"
This is an improvement on my latest series. Instead of just
dealing with the problem of destroying the Summary that I have
identified in a previous test, I have tried to find other sources
of stalls.
Some of them are on readers and would affect early processes and
operations like nodetool refresh.
Others are on writers, which can affect any SSTable being written.
Two of those stalls (on large filter, on summary read), I saw in a
synthetic benchmark where I used very small values + nodetool compact
to generate one SSTable with many keys. They were 80ms and 20ms
respectively, and now they are totally gone.
For others, I just tried to be safe (for instance, if we know
reading/writing large vectors can be costly, just always insert
preemption points in them).
With all of these patches applied, I no longer see stalls coming from
the SSTable code in those tests (although given enough time, I am sure I
can find more).
Tests: unit (release)
Fixes: #3282, Fixes#3281, Fixes#3269
"
* 'sstables-stalls-v3-updated' of github.com:glommer/scylla:
large_bitset/bloom filter: add preemption points in loops
sstables: read filter in a thread
abstract summary entry version of the token with a token view
add a token_view
sstables: rework summary entries reading
sstables: avoid calls to resize for vectors
sstables: replace potentially large for loop with do_until
summary_entry: do not store key bytes in each summary entry
tests: change tests to make summary non-copyable
chunked_vector: do not iterate to destruct trivially destructible types
SSTables that contain many keys - a common case with small partitions in
long lived nodes - can generate filters that are quite large.
I have seen stalls over 80ms when reading a filter that was the result
of a 6h write load of very small keys after nodetool compact (filter was
in the 100s of MB)
Similar care should be taken when creating the filter, as if the
estimated number of partitions is big, the resulting large_bitset can be
quite big as well.
If we treat the i_filter.hh and large_bitset.hh interfaces as truly
generic, then maybe we should have an in_thread version along with a
common version. But the bloom filter is the only user for both and even
if that changes in the future, it is still a good idea to run something
with a massive loop in a thread.
So for simplicity, I am just asserting that we are on a thread to avoid
surprises, and inserting preemption points in the loops.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Constructing filter objects can be quite expensive. We will insert some
yield points around, and that is made a lot easier if we are calling
things from a thread.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
dht::token doesn't have a trivial destructor, so destroying an array
full of those can be quite expensive. If we use the same trick as we
used for the summary - storing the token data in a stable memory
location - we can leave the entries with a trivial destructor and destroy
the chunks themselves. Those being larger, they will be more efficient
to delete.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Ideally we would like tokens to be trivially destructible, so that we
can easily dispose of giant vectors holding them. While that is hard to
do with our current infrastructure, we can introduce a token_view, which
holds a bytes_view elements instead of the real data - making it
trivially destructible.
The comparators are then changed to take a token_view, and an implicit
conversion function is provided from tokens so they get compared.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently Ubuntu 18.04 uses distribution provided g++ and boost, but it's easier
to maintain Scylla package to build with same version toolchain/libraries, so
switch to them.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521075576-12064-1-git-send-email-syuu@scylladb.com>
* 'debian-ubuntu-build-fixes-v2' of https://github.com/syuu1228/scylla:
dist/debian: build only scylla, iotune
dist/debian: switch to boost-1.65
dist/debian: switch to gcc-7.3
Like we did for generic arrays, let's move away from resize() in trying
to read summary entries and move to a reserve/push pattern.
I have tested this patch reading a summary file that a couple of MB big.
Stalls up to 20ms were seen. After applying this patch, no such stalls
are present.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
resize is considered harmful, since it will attempt to allocate memory
and initialize each element of the vector. This can cause reactor stalls
that correlates to latency peaks.
A better idiom is reserve first - so we know we will have enough memory
and won't have to move contents - and push_back/emplace_back each
individual member.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are pushing ints here, so it shouldn't be that bad in practice.
But a potentially gigantic for loop is just asking for a stall since we won't
need_preempt() it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If we store a bytes_view instead of bytes, that has a trivial destructor
and then we don't need to destroy each element individually. To do that,
we allocate the data in a couple of large arrays which can be disposed of
easily and point to it.
We still can't destroy trivially because of the token.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the summary can be copied, but in real life there is no reason
for this to be a requirement. Tests want it, so we can destroy a summary,
load another, and compare the two. We can achieve this by allowing the first
summary to be moved, and then we can still have a reference to the second.
I am about to make a change that will make the summary not copyable as a
requirement, so we need to do this first.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We get following compile error on Debian/Ubuntu with boost-1.63:
/opt/scylladb/include/boost/intrusive/pointer_plus_bits.hpp:76:48: error: '*((void*)& __tmp +136)' is used uninitialized in this function [-Werror=uninitialized]
n = pointer(uintptr_t(p) | (uintptr_t(n) & Mask));
~~~~~~~~~~~~~~^~~~~~~
This is known issue (https://github.com/boostorg/intrusive/issues/29), fixed
on boost-1.65.
Switch to boost-1.65 to fix the issue.
Fixes#3090
* seastar bcfbe0c...a66cc34 (3):
> reactor: fix sleep mode
> cpu scheduler: don't penalize first group to run
> Simple shellscript to find out which logical CPU's shards are mapped to
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.
It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.
Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:
Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds
After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds
Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
This reverts commit f792c78c96.
With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.
Reduce the time-to-recover from 5 hours to 10 minutes per stream session.
Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.
Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
When a table, keyspace, or role is created, the creator now is
automatically granted all applicable permissions on the object.
This behavior is consistent with Apache Cassandra.
Fixes#3216.
Instead of some functions in `allow_all_authorizer` throwing exceptions
and others being silently pass-through, we consistently return exception
futures with `auth::unsupported_authorization_operation`. These errors
are converted to `invalid_request_exception` in the CQL error and
ignored where appropriate in the auth subsystem.
This patch came about because of an important (and obvious, in
hindsight) realization: instances of the authorizer, role manager, and
authenticator are clients for access-control state and not the state
itself. This is reflected directly in Scylla: `auth::service` is
sharded across cores and this is possible because each instance queries
and modifies the same global state.
To give more examples, the value of an instance of `std::vector<int>` is
the structure of the container and its contents. The value of `int
file_descriptor` is an identifier for state maintained elsewhere.
Having watched an excellent talk by Herb Sutter [1] and having read an
informative blog post [2], it's clear that a member function marked
`const` communicates that the observable state of the instance is not
modified.
Thus, the member functions of the role-manager, authenticator, and
authorizer clients should not be marked `const` only if the state of the
client itself is observably changed. By this principle, member functions
which do not change the state of the client, but which mutate the global
state the client is associated with (for example, by creating a role)
are marked `const`.
The `start` (and `stop`) functions of the client have the dual role of
initializing (finalizing) both the local client state and the
external state; they are not marked `const`.
[1] https://herbsutter.com/2013/01/01/video-you-dont-know-const-and-mutable/
[2] http://talesofcpp.fusionfenix.com/post-2/episode-one-to-be-or-not-to-be-const
"
Terms
-----
querier: A class encapsulating all the logic and state needed to fill a
page. This Includes the reader, the compact_mutation object and all
associated state.
Preamble
--------
Currently for paged-queries we throw away all readers, compactors and
all associated state that contributed to filling the page and on the
next page we create them from scratch again. Thus on each page we throw
away a considerable amount of work, only to redo it again on the next
page. This has been one of the major contributors to latencies as from
the point of view of a replica each page is as much work as a fresh
query.
Solution
--------
The solution presented in this patch-series is to save queriers after
filling a page and reuse them on the next pages, thus doing the
considerable amount of work involved with creating the them only once.
On each page the coordinator will generate a UUID that identifies this
page. This UUID is used as the key, under which the contributing
queriers will be saved in the cache. On the next page the UUID from the
previous page will be used to lookup saved queriers, and the one from
the current one to saved them afterwards (if the query isn't finished).
These UUIDs (reader_recall_uuid and reader_save_uuid) are attached to
the page-state. Also attached to the page state is the list of replicas
hit on the last page. On the next page this list will be consulted to
hit the same replicas again, thus reusing the queriers saved on them.
Cached queriers will be evicted after a certain period of time to avoid
unecessary resource consumption by abandoned reads.
Cached queriers may also be evicted when the shard faces
resource-pressure, to free up resources.
Splitting up the work
---------------------
This series only fixes the singular-mutation query path, that is queries
that either fetch a single partition, or severeal single partitions (IN
queries). The fix for the scanning query path will be done in a
follow-up series, however much of the infrastructure needed for the
general querier reuse is already introduced by this series.
Ref #1865
Tests: unit-tests(debug, release), dtests(paging_test, paging_additional_test)
Benchmarking summary (read-from-disk)
-------------------------------------
1) Latency
BEFORE
latency mean : 58.0
latency median : 57.4
latency 95th percentile : 68.8
latency 99th percentile : 79.9
latency 99.9th percentile : 93.6
latency max : 93.6
AFTER
latency mean : 41.3
latency median : 40.5
latency 95th percentile : 50.8
latency 99th percentile : 68.9
latency 99.9th percentile : 89.2
latency max : 89.2
2) Throughput (single partition query)
sum(scylla_cql_reads):
BEFORE: 173'567
AFTER: 427'774
+246%
3) Throughput (IN query, 2 partitions)
sum(scylla_cql_reads):
BEFORE: 85'637
AFTER: 127'431
+148%
"
* '1865/singular-mutations/v8.2' of https://github.com/denesb/scylla: (23 commits)
Add unit test for resource based cache eviction
Add unit tests for querier_cache
Add counters to monitor querier-cache efficiency
Memory based cache eviction
Add buffer_size() to flat_mutation_reader
Resource-based cache eviction
Time-based cache eviction
Save and restore queriers in mutation_query() and data_query()
Add the querier_cache_context helper
Add querier_cache
Add querier
Add are_limits_reached() compact_mutation_state
Add start_new_page() to compact_mutation_state
Save last key of the page and method to query it
Make compact_mutation reusable
Add the CompactedFragmentsConsumer
Use the last_replicas stored in the page_state
query_singular(): return the used replicas
Consider preferred replicas when choosing endpoints for query_singular()
Add preferred and last replicas to the signature of query()
...
Specifically for the reader-permit based eviction. This test lives in a
separate executable as it uses with_cql_test_env() and thus needs a
main() of it's own.
"
This patchset is a part of a bigger effort for bringing our
microbenchmarking tests from the source tree to be used for regression
testing purposes with CI.
Now, it is possible to export results of tests run into JSON format that
can be stored in ElasticSearch and compared among runs to detect
performance degradation should it happen.
Example of JSON output (formatted for readability):
{
"results" :
{
"parameters" :
{
"read" : "64",
"read,skip,test_run_count" : "64,256,1",
"skip" : "256",
"test_run_count" : 1
},
"stats" :
{
"(KiB)" : 126960,
"aio" : 993,
"blocked" : 208,
"c blk" : 1,
"c hit" : 0,
"c miss" : 1,
"cpu" : 99.779365539550781,
"dropped" : 0,
"frag/s" : 311939.61559016741,
"frags" : 200000,
"idx blk" : 0,
"idx hit" : 0,
"idx miss" : 0,
"time (s)" : 0.641149729
}
},
"test_group_properties" :
{
"message" : "Testing scanning large partition with skips.\nReads whole range interleaving reads with skips according to read-skip pattern",
"name" : "large-partition-skips",
"needs_cache" : false,
"partition_type" : "large"
},
"versions" :
{
"scylla-server" :
{
"commit_id" : "4acfa17f4",
"date" : "20180306",
"run_date_time" : "2018-16-06 12:16:41",
"version" : "666.development"
}
}
}
"
* 'issues/2947/v6' of https://github.com/argenet/scylla:
Add support for JSON output format for perf_fast_forward results.
Wrap output for customization. Move all output handling to a single managing class.
Add the following counters:
(1) querier_cache_lookups
(2) querier_cache_misses
(3) querier_cache_drops
(4) querier_cache_time_based_evictions
(5) querier_cache_resource_based_evictions
(6) querier_cache_memory_based_evictions
(6) querier_cache_population
(1) counts the total number of querier cache lookups. Not all
page-fetches will result in a querier lookup. For example the first page
of a query will not do a lookup as there was no previous page to reuse
the querier from. The second, and all subsequent pages however should
attempt to reuse the querier from the previous page.
(2) counts the subset of (1) where the read have missed the querier
cache (failed to find a matching saved querier).
(3) counts the subset of (1) where the querier was recalled and dropped
immediately. This can happen for example if the querier was at the wrong
position.
(4) counts the cached queriers that were evicted due to their TTL
expiring.
(5) counts the cached queriers that were evicted due to reader-resource
(those limited by reader-concurrency limits) shortage.
(6) counts the cached queriers that were evicted due to reaching the
cache's memory limits (currently set to 4% of the shards' memory).
(7) is the current number of entries in the cache
Note:
* The count of cache hits can be derived from these counters as
(1) - (2).
* cache_drop (3) also implies a cache hit (see above). This means that
the number of actually reused queriers is:
(1) - (2) - (3)
To bound the memory consumption of the querier-cache the total memory
consumption of the cached queriers is limited to 4% of the shard's total
memory.
When inserting a new querier it is first checked whether it's insertion
would cause the limit to be crossed. If this is the case existing
entries are evicted until the memory consumption is sufficiently reduced
so that after inserting the querier it stays below the limit.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
To calculate the memory consumption of the cached queriers
flat_mutation_reader::buffer_size() is used. While this is not very
precise as it doesn't include object sizes and member containers it
gives a good picture of the memory consumption of the queriers.
Memory based cache eviction overlaps with resource-based cache eviction
but only to some degree as that only accounts the memory consumption of
sstable readers.
buffer_size() exposes the collective size of the external memory
consumed by the mutattion-fragments in the flat reader's buffer. This
provides a basis to build basic memory accounting on. Altought this is
not the entire memory consumption of any given reader it is the most
volatile component and usually by far the largest one too.
Readers serving user-reads need to obtain a permit to start reading.
There exists a restriction on how much active readers can be admitted
based on their count and their memory onsumption.
Since the saved readers of cached queriers are techically active (they
hold a permit) they can block new readers from obtaining a permit.
New readers have a higher priority because a cached reader might be
abandoned or used later at best so in the face of memory pressure we
evict cached readers to free up permits for new readers.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
Cached queriers should not sit in the cache indefinitely otherwise
abandoned reads would cause excess and unncessary resource-usage. Attach
an expiry timer to each cache-entry which evicts it after the TTL
passes.
Use the querier_cache (represented by the passed-in
querier_cache_context) object to lookup saved queriers at the start of
the page and save them at the end of it if it is likely that there will
be more page requests.
querier_cache_context is supposed to make propagating the cache and the
key down the layers. It comes bundled with some of the required
parameters (the lookup and save state) and aso hides all of the
boiler-plate of dealing with the cache (checking whether the key is
non-empty, etc.). It also makes it possible to not use the cache and
hide this from the lower layers.
This is the cache where suspended queriers are going to be saved between
pages. This is not a general purpose cache. It caters to the specific
needs of the querier recall mechanism. More specifically:
(1) Cache entries are of single-use, they are inserted once and the first
lookup removes them. Multiple items may be stored under a single key.
Identifying the correct one happens based on additional information like
the query range. Lookup knows to drop queriers when they cannot be used
to serve the next page.
(2) Cache entries are evicted after a certain time to avoid the
depletion of resources due to abandoned reads.
(3) Cache entries are evicted when facing reader-permit shortage, until
either enough permits are freed up or all entries are evicted.
(4) A memory limiter is set up which keeps the total memory consumption
of the cache under a limit (4% of memory) by evicting the oldest entries
when inserting a new one would cause the total memory consumption to go
above the limit.
(5) It updates the relevant counters of the db_stats.
This patch only implements (1), the other features will be implemented
in their own patches.
The querier encapsulates all objects needed to serve queries, except
result builders. It is designed to be suspendable, savable and
resumable. It contains all logic needed to suspend, resume and determine
whether the querier can be resumed or not.
It is the foundation upon which the "reader-reuse" mechanism is built.
are_limits_reached() allows querying whether the compactor reached
the page's limits. This is needed to determine whether there will be
more pages and thus whether the compact_mutation_state has to be kept
around.
start_new_page() resets the limits to the current page's ones and
sets the _empty_partition flag so that the partition header (if the last
page finished inside a partition) will be reemitted.
Make a copy of the current decorated-key in consume_end_of_stream() so
that it persists while the compaction state is suspended.
Also add current_partition() to allow client code to query the partition
the compaction is positioned in. This is needed to determine whether
the start position of the next page matches that of the
compact_mutation_state.
Currently compact_mutation is used as a use-once-then-throw-away object.
After it satisfies its consumer it's destroyed together with the
consumer. This conflicts with the effort to save and reuse readers and
associated infrastructure between pages of a query.
To resolve this conflict compact_mutation is split into two classes:
(1) compact_mutation_state
(2) compact_mutation
compact_mutation_state encapsulates all the compaction logic and state,
while compact_mutation continues to provide the same API using
compact_mutation_state behind the scenes.
compact_mutation_state doesn't store the consumer, instead its
consume_* methods are templated on the consumer and take it as an
argument. This allows compact_mutation_state to be independent of the
consumer's type.
Additionally compact_mutation can now be constructed from a shared
pointer to compact_mutation_state. This allows client code to
pre-construct a compaction state and retain it after the
compact_mutation object is destroyed.
These changes allow the state of a compaction to be saved and restored
later while code that is only interested in storing the saved state
can stay independent of the consumer's type.
This patch only contains the splitting of compact_mutation into
compact_mutation and compact_mutation_state. The next patches will add
the missing functionality that is needed to make compact_mutation_state
truly reusable across pages.
Pass the last_replicas from the page_state as the preferred_replicas
for query() and save the returned last_replicas as the last_replicas
field of the next page_state. The circle is now complete. The first page
of any query will pass an empty list as the preferred replicas (having
no previous paging_state) so the replicas will be selected according to
the load-balancing strategy. Any subsequent page will use the last
replicas from the last page as the preferred ones for the current one.
Thus if all goes well all pages of a query will hit the same replicas.
This patch implements the last_replicas returning part of the query()
signature changes for singular queries. It allows for client code to
save the last returned replicas and pass it to query() on the next page
as the preferred-replicas parameter, thus faciliate the read requests
for the next page hitting the same replicas.
Propagate the preferred_replicas to db::filter_for_query() and consider
them when selecting the endpoints. The algoritm for selecting the
endpoints is as follows:
* Compute the intersection of the endpoint candidates and the
preferred endpoints.
* If this yields a set of endpoints that already satisfies the CL
requirements use this set.
* Otherwise select the remaining endpoints according to the
load-balancing strategy, just like before.
preferred_replicas are added to the parameters and last_replicas are
added to the return type. The preferred replicas will be used as a hint
for the selection of the replicas to send the read requests to. The last
replicas (returned) are the replicas actually selected for the read.
This will allow queries to consistently hit the same replicas for each
page thus reusing readers created on these replicas.
For convenience a query() overload is provided that doesn't take or
return the preferred and last replicas.
This patch only adds the parameters and propagates them down to
query_singular() and query_partition_key_range(). The code to actually
use these preferred-replicas will be added in later patches.
This reason for separating this is to reduce noise and improve
reviewability for those functional changes later.
Helps paged queries consistently hit the same replicas for each
subsequent page. Replicas that already served a page will keep the
readers used for filling it around in a cache. Subsequent page request
hitting the same replicas can reuse these readers to fill the pages
avoiding the work of creating these readers from scratch on every page.
In a mixed cluster older coordinators will ignore this value.
The value of last_replicas may change between pages as nodes may become
available/unavailable or the coordinator may decide to send the read
requests to different replicas at its discretion.
Replicas are identified by an opaque uuid which should only make sense
to the storage-proxy.
This patch adds the parameter to read_command which is needed for
caching of readers during multiple pages of a paged queries, which
we will introduce in the next patches.
The query_uuid is a UUID of a previously saved reader, which
the replica is now asked to recall and resume (if this saved reader is
no longer in the cache, it is fine, a new reader will be started).
Additionally a helper flag is_first_page is added so that the replica
can avoid doing any cache lookups (and incrementing miss counters) for
the first page.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds to the "paging_state", the opaque cookie that clients are
supposed to provide when asking for the next page on a paged query, a
unique id field. This new field will be used to tell that a new request
for a page really continues the previous page, and doesn't just by chance
start at the same position the previous page stopped.
We need to support setups with mixed versions - a client may get a paging
state from a coordinator running a new version of Scylla and send it to
a different coordinator running an old version - or vice versa. So the new
uuid field is set up to have a default uuid of UUID() (a recognizable
invalid uuid 0), so new versions receiving no uuid from an old version will
set this invalid uuid, and old versions receiving a uuid from a new version
will simply ignore it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Refactor some includes to reduce dependencies on messaging_service.hh,
which can change quite a lot as it includes many unrelated items itself.
Tests: build
* tag 'includes/messaging_service.hh/v1' of https://github.com/avikivity/scylla:
tests: reduce dependencies in test_services.hh
migration_manager: remove dependency on messaging_service.hh in header
messaging_service: move msg_addr into its own header file
Convert storage_service_for_test to a pimpl implementation to
reduce dependencies. Tests that depended on those includes were
fixed to include their dependencies directly.
Try to avoid recompilations by reducing inclusions of system_keyspace.hh
in other header files.
Tests: unit (release)
* tag 'system_keyspace.hh/v1' of https://github.com/avikivity/scylla:
storage_service: remove system_keyspace.hh include
locator: de-inline reconnectable_snitch_helper
locator: de-inline production_snitch_base
cql3: remove #include of system_keyspace.hh
De-inlining allows us to remove some dependencies, and those functions
are too complex to inline anyway.
A few always-throwing functions get the [[noreturn]] attribute to
avoid damaging code generation.
* seastar 08e02dc...42159d4 (9):
> memory: avoid unconditional calls to __tls_init
> io_tester: bring back information about think time
> Merge "Avoid continuations in I/O Scheduler path" from Glauber
> Merge "Extend io_tester to support CPU loads" from Glauber
> tutorial: fix undue complication in semaphore get_units() example
> Tutorial: in HTML target, inline code snippets shouldn't be gray
> tutorial: add build target for split HTML file
> tutorial: mention seastar::thread as option for object lifetime management
> tutorial: document new seastar::future::wait()
* dist/ami/files/scylla-ami 3aa87a7...5170011 (3):
> scylla_install_ami: install enhanced networking NIC drivers
> scylla_install_ami: set kernel-ml as default kernel
> scylla_install_ami: fix NIC down with enhanced networking on new base AMI
already called in update_info_for_opened_data() which is called by
open_data(); no need for clustering components to be set early
either.
found it when auditing the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180310225213.26017-1-raphaelsc@scylladb.com>
unsigned type was incorrectly used for keeping track of min and max
timestamp, so a negative number would be treated as a very high
number that would *incorrectly* end up as max timestamp in sstable
metadata.
Fixes#3000.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180308162217.18963-1-raphaelsc@scylladb.com>
"
Refs #2692Fixes#3246
The current restricting algorithm [1] restricts the active-reader queue
based on the memory consumption of the existing active readers. When
this memory consumption is above the limit new readers are not admitted.
The inactive reader queue on the other hand has a fixed length.
This caused performance regressions on two workloads:
* read-only: since the inactive-reader queue length is severly limited
(compared to the previous situation) reads will timeout at loads
comfortably handled before.
* mixed: since the memory consumption happens only at admission time
(already created active readers are not limited) memory consumption
growed significantly causing problems when compactions kicked in.
The solution is to reintroduce the old limit of 100 active concurrent
user-reads while still keeping the memory-based limit as well. For
workloads that don't consume a lot of memory or on large boxes with lots
of memory the count-based limit will be reached which is reverting to the
old well-known behaviour. For memory-hungry workloads or on small boxes
with little memory the memory based-limit will kick in sooner avoiding
memory overconsumption.
[1] introduced by bdbbfe9390
"
* 'restricted-reader-dual-limit/v3' of https://github.com/denesb/scylla:
Modify unit tests so that they test the dual-limits
Use the reader_concurrency_semaphore to limit reader concurrency
Add reader_concurrency_semaphore
Add reader_resource_tracker param to mutation_source
mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
This semaphore implements the new dual, count and memory based active
reader limiting. As purely memory-based limiting proved to cause
problems on big boxes admitting a large number of readers (more than any
disk could handle) the previous count-based limit is reintroduced in
addition to the existing memory-based limit.
When creating new readers first the count-based limit is checked. If
that clears the memory limit is checked before admitting the reader.
reader_conccurency_semaphore wraps the two semaphores that implement
these limits and enforces the correct order of limit checking.
This class also completely replaces the restricted_reader_config struct,
it encapsulates all data and related functinality of the latter, making
client code simpler.
Soon, reader_resource_tracker will only be constructible after the
reader has been admitted. This means that the resource tracker cannot be
preconstructed and just captured by the lambda stored in the mutation
source and instead has to be passed in along the other parameters.
In preparation to reader_concurrency_semaphore being added to the file.
The reader_resource_tracker is really only a helper class for
reader_concurrency_semaphore so the latter is better suited to provide
the name of the file.
"
This series switches granularity of memory-pressure-induced eviction in cache
from a partition to a row.
Since 9b21a9b cache can store partial partitions with row granularity but they
were still evicted as a unit. This is problematic for the following reasons:
- more is evicted than necessary, which decreases cache efficiency. In the
worst case, whole cache gets evicted at once
- evicting large amounts of memory (large partitions) at once may impact
latency badly
Fixes#2576.
See the documentation added in patch titled "doc: Document row cache eviction"
for details on how eviction works.
Open issues to be fixed incrementally:
- range tombstones are not evictable
- cache update still has partition granularity, which
causes bad latency on memtable flush with large partitions
"
* tag 'tgrabiec/row-level-eviction-v3' of github.com:scylladb/seastar-dev: (43 commits)
doc: Document row cache eviction
tests: cache: Add tests for row-level eviction
tests: cache: Check that data is evictable after schema change
tests: cache: Move definitions to the top
tests: perf_cache_eviction: Switch eviction counter to row granularity
tests: row_cache_alloc_stress: Avoid quadratic behavior
cache: Introduce unlink_from_lru()
cache: Add row-level stats about cache update from memtable
mvcc: Propagate information if insertion happened from ensure_entry_if_complete()
cache: Track number of rows and row invalidations
cache: Evict with row granularity
cache: Track static row insertions separately from regular rows
tests: mvcc: Use apply_to_incomplete() to create versions
tests: mvcc: Fix test_apply_to_incomplete()
tests: cache: Do not depend on particular granularity of eviction
tests: cache: Make sure readers touch rows in test_eviction()
mvcc: Store complete rows in each version in evictable entries
mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_in_latest()
tests: cache: Invoke partial eviction in test_concurrent_reads_and_eviction
cache: Ensure all evictable partition_versions have a dummy after all rows
...
Partitions corresponding to keys have 40k rows. With row-level
eviction touching them inside the loop became a serious performance
issue, because touch() now needs to walk over all rows.
Will be used in row_cache_alloc_stress to unlink partitions which we
don't want to get evicted, instead of reapeatedly calling touch() on
them after each subsequent population. After switching to row-level
LRU, doing so greatly increases run time of the test due to quadratic
behavior.
The address and keyspace should be swapped.
Before:
range_streamer - Bootstrap with ks3 for keyspace=127.0.0.1 succeeded,
took 56 seconds
After:
range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded,
took 56 seconds
Message-Id: <5c49646f1fbe45e3a1e7545b8470e04b166922c4.1520416042.git.asias@scylladb.com>
The JSON output is arranged in a way that makes it easier to upload
results to ElasticSearch.
All the tests results are placed under the perf_forward_data_output/ directory
For test groups, we create separate subdirectories where we save results
from runs of tests in those groups.
For each test run, we store results in a separate file named:
<dash-separated-param-list>.<run-number>.json
where
<dash-separated-param-list> is a dash-separated list of parameters of the current
test, e.g., 1-64 (for read-skip pattern).
<run-number> is the number of run of this test with the specified
parameters. This is needed as the same list of parameters can be
used more than once (for instance, when cache is enabled).
Those numbers start with 1, i.e., 1, 2, 3.
So, the path to a resulting JSON file may look like:
perf_fast_forward_output/large-partition-skips/64-4096.1.json
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Instead of passing the output parameters to std::cout straight away, use
helper wrappers. This will allow us to add more formats for gathered
tests results.
Introduce helper writer classes hierarchy that can be extended to
support different output formats (JSON, XML, etc).
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Instead of evicting whole partitions, evicts whole rows.
As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
For row-level eviction we need to ensure that each version has
complete rows so that eviction from older versions doesn't affect the
value of the row in newer snapshots.
This is achieved by copying the row from an older version before
applying the increment in the new version.
Only affects evictable entries, memtables are not affected.
Every evictable version will have a dummy entry at the end so that it can be
tracked in the LRU.
It is also needed to allow old versions to stay around (with
tombstones and static rows) after all rows are evicted. Such versions
must be fully discontinuous, and we need some entry to mark that.
We will need to propagate a cache_tracker reference to evict(). Instead
of evicting from destructor, do so before cache_entry gets unlinked
from the tree. Entries which are not linked, don't need to be
explicitly evicted.
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.
Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:
v2: <key=2, cont=1>
v1: <key=1, cont=1>
In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:
v2: <key=2, cont=1>
Here, the range (-inf, 1) would be marked as continuous, which is not true.
To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.
It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().
MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.
The example from the beginning will become:
v2: <key=1, cont=0> <key=2, cont=1>
v1: <key=1, cont=1>
When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
Simply copying mutations which are not fully continuous may violate
MVCC invariants, like the one about non-overlapping continuity which
will be added later. Use apply_to_incomplete() instead.
This unfortunately reduces strenght of the test, since the continuity
of the entry is now completely determined by the first version. We should
use populate() instead, but it doesn't exist yet. It could be extracted
from cache_streamed_mutation, but that's not an easy change.
This is alleviated by adding a similar test to row_cache_test_g, in a
later patch.
In commit 8af0b501a2 (gossip: wait for stabilized gossip on bootstrap)
The force_after variable was changed from int32_t to stdx::optional<int32_t>
- if (force_after > 0 && total_polls > force_after) {
+ if (force_after && total_polls > *force_after) {
Checking force_after > 0 was dropped which is wrong because force_after
is set to -1 by default. So the if branch will always be executed after
1 poll.
We always see:
[shard 0] gossip - Gossip not settled but startup forced by
skip_wait_for_gossip_to_settle. Gossp total polls: 1
even if skip_wait_for_gossip_to_settle is not set at all.
Fixes#3257
Message-Id: <845d219cea6101a7c507c13879c850a5c882e510.1520297548.git.asias@scylladb.com>
In Crypto++ v6, the `byte` typedef has been moved from the global
namespace to the CryptoPP:: namespace.
To make Scylla code compile with both old and new versions, bring the
namespace in so that the code works regardless of the scope of `byte`
definition.
Fixes#3252
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <60e7bfe868b778b1c9bbe15d7247db64b61bd406.1520272198.git.vladimir@scylladb.com>
"-I$full_builddir/{mode}/xxhash" doesn't resolve to a valid path, because
full_builddir is a Python variable, not a Ninja variable. In build.ninja
it appears as "-I/release/xxhash".
Since the build nevertheless works, we can remove the broken flag instead
of fixing it.
Message-Id: <20180305135919.13634-1-avi@scylladb.com>
This patch fixes an issue with test_propagation(), where the test
assumed that after the future returned from wait_for_pending(0)
resolved, the continuations set for the post operation had already
run, which is not true.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180305131908.7667-1-duarte@scylladb.com>
reader_wrapper's _timeout defaults to now(), which means to time
out immediately rather than no timeout.
Fix by switching to a time_point, defaulting to no_timeout, and
provide a compatible constructor (with a duration parameter) for
callers that do want a duration-based timeout.
Tests: mutation_reader_test (debug, release)
Message-Id: <20180305111739.31972-1-avi@scylladb.com>
* seastar f841d2d...08e02dc (3):
> future: make future::wait() a supported function
> scripts: perftune.py: don't allow cpu-mask that does't include any IRQ CPU
> Tutorial: show nice dashes in HTML
The message in question is printed with printf() which is bad by itself.
And most importantly this test uses a single .property file so this message
doesn't add any interesting information to begin with. Therefore it makes
more sense to drop it than to fix it.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1519661059-13325-1-git-send-email-vladz@scylladb.com>
This patch series ties up some loose ends around CQL syntax for access-control statements.
The USER-based syntax statements are all backwards compatible. ROLE-specific statements have a new syntax which is described in "cql: Make role syntax for consistent". Other statements (like GRANT) have been updated to accept role names (instead of the more restrictive `username` rule).
Fixes#3217.
Tests: unit (debug)
* 'jhk/roles_syntax/v2' of https://github.com/hakuch/scylla:
tests: Rename test for consistency
cql: Eliminate uses of legacy `username` rule
cql: Elaborate error for quoted user names
cql: Allow role names to be string literals
cql: Make role syntax more consistent
tests: Add CQL syntax tests for access-control
Since quoted names are allowed for role names, we add a more descriptive
error message when a quoted name is (erroneously) used for a user name.
This behavior is consistent with Apache Cassandra.
This patch changes the syntax for CQL statements related to roles to
favor a form like
CREATE ROLE sam WITH PASSWORD = 'shire' AND LOGIN = false;
instead of
CREATE ROLE sam WITH PASSWORD 'shire' NOLOGIN;
This new syntax has the benefit of not imposing any ordering constraints
on the modifiers for roles and being consistent with other parts of the
CQL grammar. It is also consistent with syntax in Apache Cassandra.
The old USER-based statements (CREATE USER and ALTER USER) still have
the old forms for backwards compatibility.
A previous change modified the USER-related statements to allow for the
OPTIONS option. However, this was a mistake; only the PASSWORD option
should have been allowed. This patch also corrects this mistake.
These are quick-running tests for verifying the accepted forms of CQL
statements (and fragments) related to access-control: users, roles, and
permissions.
Establishing the allowed forms of statements is helpful for reference,
but also makes syntax changes (like those expected in later patches)
clearer and more safe.
Since we splited scylla-housekeeping service to two different services for systemd, we don't share same service name between systemd and upstart anymore.
So handle it independently for each distribution, try to install
/etc/init/scylla-housekeeping.conf on Ubuntu 14.04.
Fixes#3239
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519852659-10688-1-git-send-email-syuu@scylladb.com>
Support POWER architecture on Scylla.
Since DPDK is not fully supported on POWER (no PMD supported on it yet),
disabled it for now.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180228203048.21593-1-syuu@scylladb.com>
swap_tree() doesn't change the color of the header, and becasue header
was not initialized, it is undefined (can be both red or black). One
problem this causes is that algo::is_header() expects the header to be
always red. It is used by unlink(), which for trees which have a black
header would infinite-loop.
The fix is to initialize the header.
Fixes#3242.
Message-Id: <1519815091-13111-1-git-send-email-tgrabiec@scylladb.com>
Debian and ubuntu list files come in two variations.
The housekeeping should support both.
This patch change the regexp that match the os in the repository file.
After the introduction of the second list variation, the os name can be in the middle of the path not only at the end.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180227092543.19538-1-amnon@scylladb.com>
Currently serializing and deserializing singular ranges is asymetric.
When serializing a range we use the start() and end() functions to
obtain _start and _end respectively. However for singular ranges end()
will return _start and therefore the serialized range will have two
engaged optionals for bounds whereas the in-memory version will have only
one. The immediate consequence of this is that after serializing and
deserializing a range it will not compare equal to the original
serialized range. Needless to say this is *very* suprising behaviour.
To remedy the issue we fix the wrapping_range's constructor to not set
_end to the passed in value when the range is singular.
This way the on-wire format can stay compatible to how the range is
percieved by client code (when is_singular(): start() == end()) but
constructing the range from the wire-format will yield a range that will
always compare equal to the original one.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <e5f20b7b45f65ca1f7b347dcccd2ac462869e7ff.1519652739.git.bdenes@scylladb.com>
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)
Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).
Table/view tables already contain an extension element, so
we utilize this to persist config.
schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.
Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.
Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"
* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
extensions: Small unit test
sstables: Process extensions on file open
sstables::types: Add optional extensions attribute to scylla metadata
sstables::disk_types: Add hash and comparator(sstring) to disk_string
schema_tables: Load/save extensions table
cql: Add schema extensions processing to properties
schema_tables: Require context object in schema load path
schema_tables: Add opaque context object
config_file_impl: Remove ostream operators
main/init: Formalize configurables + add extensions to init call
db::config: Add extensions as a config sub-object
db::extensions: Configuration object to store various extensions
cql3::statements::property_definitions: Use std::variant instead of any
sstables: Add extension type for wrapping file io
schema: Add opaque type to represent extensions
sstables::compress/compress: Make compression a virtual object
LSA being an allocator built on top of the standard may hide some
erroneous usage from AddressSanitizer. Moreover, it has its own classes
of bugs that could be caused by incorrect user behaviour (e.g. migrator
returning wrong object size).
This patch adds basic sanitizer for the LSA that is active in the debug
mode and verifies if the allocator is used correctly and if a problem is
found prints information about the affected object that it has collected
earlier. Theat includes the address and size of an object as well as
backtrace of the allocation site. At the moment the following errors are
being checked for:
* leaks, objects not freed at region destructor
* attempts to free objects at invalid address
* mismatch between object size at allocation and free
* mismatch between object size at allocation and as reported by the
migrator
* internal LSA error: attempt to allocate object at already used
address
* internal LSA error: attempt to merge regions containing allocated
objects at conflicting addresses
Message-Id: <20180226122314.32049-1-pdziepak@scylladb.com>
Make sure idx will not be equal to _control_points.size() (and thus
overflow the vector) when looking for the first control-point with
a backlog not smaller then the current one, by stopping when it's equal
to _control_points.size() - 1.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <47841592792573d820650d570fa1ab7e58bdac2c.1518700405.git.bdenes@scylladb.com>
* seastar 383ccd6...f841d2d (8):
> Merge "Randomize task queue in debug mode" from Duarte
> tutorial: document seastar::thread
> tutorial: add missing seastar namespace
> tutorial: note about asynchronous functions throwing exceptions
> thread: stop backtraces on aarch64 from underflowing the stack
> Revert "core:🧵 ARM64 version of annotating the frame"
> core:🧵 ARM64 version of annotating the frame
> core/future-util: Release exception in repeater
Release mode flags are properly propagated through seastar --optflags
flag, but debug mode flags aren't. This is problematic since they are
used to enable additional debugging features.
After this patch we will end up with some duplicate flags, but that's
not really a problem.
Message-Id: <20180223173617.15199-1-pdziepak@scylladb.com>
Before 312bd9ce25, boot had to call all shards for each sstable
such that they would agree/disagree on their deletion, an atomic
deletion manager requirement.
After its removal, we can afford to call only the shards that own
a given sstable.
Reducing the operation on each sstable from (SSTABLES) * (SHARD_COUNT)
to usually (SSTABLES). It may be the same as before after resharding,
but resharding is an one-off operation.
Boot time should be significantly reduced for nodes with a high smp
count and column family using leveled strategy (which can end up with
thousands of sstables).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180220032554.17776-1-raphaelsc@scylladb.com>
This patch takes a modified version of the Ubuntu 14.04 housekeeping
service script and uses it in Docker to validate the current version.
To disable the version validation, pass the --disable-version-check flag
when running the container.
Message-Id: <20180220161231.1630-1-amnon@scylladb.com>
With the changes introduced in #2981 and #3189, the lifetime management
of the objects used by index_reader became more complicated.
This patchset addresses the immediate problems caused by lack of proper
handling.
The more holistic approach to this will take more time and is to be
implemented under #3220. The current fix, however, should be good
enought as a stop-gap solution.
* 'issues/3213/v3' of https://github.com/argenet/scylla:
Close promoted index streams when closing index_readers.
Support proper closing of prepended_input_stream.
Promoted index input streams must be explicitly closed when closing the
index_reader in order to ensure all the pending read-aheads are
completed.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"This series adds the GoogleCloudSnitch.
Fixes#1619"
* 'google-cloud-snitch-v4' of https://github.com/vladzcloudius/scylla:
config: uncomment/add the supported snitches description
tests: added gce_snitch_test
locator::gce_snitch: implementation of the GoogleCloudSnitch
locator::snitch_base: properly log the failure during the snitch startup
The test inserts some values with a TTL of 1 second and then
reads them back expecting them not to be expired yet. That may not
always be the case if the machine is slow and we are running in the
debug mode. Increasising the TTLs by x100 should help avoid these
false positives.
Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com>
"This series adds an API to return the active repairs by their IDs.
After this series a call to:
curl -X GET --header "Accept: application/json" "http://localhost:10000/storage_service/active_repair/"
Will return an array with the ids of the active repairs.
Fixes#3193"
* 'amnon/get_active_repairs_v3' of github.com:scylladb/seastar-dev:
API: Add get active repair api
repair: Add a get_active_repairs function to return the active repair
stream_throughput_outbound_megabits_per_sec is not supported and is
found in the unsupported part of scylla.yaml.
This patch removes it from the supported part of the file.
Fixes#2876
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180219111421.30687-1-amnon@scylladb.com>
Operations on a append_challenged_posix_file_impl schedule asynchronous
operations when they are executed, which capture the file object. To
synchronize with them and prevent use-after-free, we need to call
close() before destroying the file.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217170556.27330-1-duarte@scylladb.com>
This test relied on task execution order to work correctly. Namely, it
relied on parent regions being reclaimed before child regions
(reclaiming is an asynchronous process started by a call to
start_reclaiming()). This order is necessary because child regions
don't know about parent regions when calculating the biggest region
that should be reclaimed.
We fix this by forcing the reclaim order.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217121655.26057-1-duarte@scylladb.com>
Operations on a segment's underlying append_challenged_posix_file_impl,
such as truncate(), schedule asynchronous operations when they are
executed, which capture the file object. To synchronize with them and
prevent use-after-free, we need to call close() and only delete the
segment and file when the returned future resolves.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216235754.24257-1-duarte@scylladb.com>
When shutting down the commitlog we try to block all new requests by
acquiring all available resources. We were, however, letting go of the
semaphore permits too early, before closing the gate and shutting down
the active segments.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216234826.24111-1-duarte@scylladb.com>
This series takes Scylla most of the way to supporting roles, and
eliminates old user-based code. All the old user-based CQL statements
and functionality should exist as they did before, except now they are
backed internally by roles.
While all the functionality for supporting roles should be present,
role-specific features like granting a role to another role still warn
as "unimplemented". This will continue until the next series addresses
the final touches. These remaining items are:
- A slightly revised CQL syntax consistent with Apache Cassandra's
revised role syntax.
- A user is automatically granted permissions on resources they create.
Users running a previous version of Scylla should be able to seamlessly
upgrade to a version of Scylla with this series merged. When a newly
upgraded node starts, it detects the presence of old metadata and copies
it to the new metadata tables if no nondefault new metadata yet exists.
A new gossiper feature flag, ROLES, also ensures that access-control
data is not modified while a cluster is in a partially-upgraded state.
If, when the cluster is in a partially upgraded state, a client connects
to an un-upgraded node then likely the change will not be propogated to
the new metadata table. We will document that changes to access-control
are not supported while upgrading in order to account for both cases
(a client connecting to an upgraded and a non-upgraded node).
All unit tests pass (except those which also fail on `master`).
I've run auth-related dtests and they all pass, except for tests which
depend on the old security model and which are therefore invalid.
Upstream dtests have been updated to account for this new security model,
and I will open an appropriate pull request to to similarly update our
own version.
I have also done a test-run cluster upgrade procedure with ccm
consisting of a 3 node cluster. I began by creating the cluster from
`master` and increasing the replication factor of the `system_auth`
keyspace to 3 and repairing the nodes. I then created several users and
granted them permissions on some resources. I then stopped a node,
updated its hardlinked executable to Scylla built from this patch series
, and restarted the node. I observed the migration of legacy data
starting and finishing. Connecting to the node, I observed all the new
roles functionality was working correctly. I verified that attempting to
change access-control information failed with a message about an
upgrading cluster. I repeated the process, node by node, with the
remaining two nodes and finally observed that the entire cluster had
upgraded and that I could modify access-control information freely. I
will encapsulate this test into a dtest if possible.
Fixes#1941.
* 'jhk/switch_to_roles/v6' of https://github.com/hakuch/scylla: (83 commits)
cql3: Remove some unimplemented warnings
cql3: Prevent unhandled exception for anonymous user
auth: Add alias for set of role names
auth: Revoke permissions on dropped role resources
auth: Move definition to corresponding .cc file
cql3: Fix life-time of `user` from `client_state`
auth: Migrate legacy data on boot
auth: Check protected resources of the role-manager
auth: Protect authenticator resources
service/client_state: Correct erroneous comment
client_state: Fix error message
cql3: Fix error handling for GRANT and REVOKE
auth: Remove unnecessary `sstring` allocation
cql3: Rename variables to reflect roles
auth: Decouple authorization and role management
auth: Add code to expand a resource family
cql: Also add `username` col. for LIST PERMISSIONS
cql3: Fix error handling in LIST PERMISSIONS
auth: Change error messages to pass dtests
cql3: Handle errors more precisely for roles
...
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.
The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.
Fixes#3215.
Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
"Fixes two issues:
- update may abort if allocation of an empty partition_version fails
- LSA region construction is not exception safe, it may leave the misconstructed
region registered if allocation inside region_group::add() fails."
* tag 'tgrabiec/exception-safety-cache-update-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add test for exception safety of updates from memtable
tests: flat_reader_assertions: Improve failure message
cache: Handle exceptions from make_evictable()
tests: Disable failure injection around background compactor
lsa: Disable allocation failure injection inside merge()
lsa: Make region deregistration robust against duplicates
lsa: Make region allocation exception safe
While there are some small remaining features for roles, all the old
user-based statements still exist as they did before (except now they're
backed by roles) and should not log warnings.
Previously, when a table or keyspace was dropped, the
authorizer (through a `migration_listener`) automatically dropped all
permissions granted on that resource.
Likewise, when a role is granted permissions and the role is dropped,
all permissions granted to the role are dropped.
In this change, we now treat role resources just like table and keyspace
resources: if a permission is granted on a role (like "GRANT AUTHORIZE
ON ROLE qa TO phil") and the "qa" role is dropped, then all permissions
on the "qa" role resource are also dropped.
This change allows for seamless migration of the legacy users metadata
to the new role-based metadata tables. This process is summarized in
`docs/migrating-from-users-to-roles.md`.
In general, if any nondefault metadata exists in the new tables, then
no migration happens. If, in this case, legacy metadata still exists
then a warning is written to the log.
If no nondefault metadata exists in the new tables and the legacy tables
exist, then each node will copy the data from the legacy tables to the
new tables, performing transformations as necessary. An informational
message is written to the log when the migration process starts, and
when the process ends. During the process of copying, data is
overwritten so that multiple nodes racing to migrate data do not
conflict.
Since Apache Cassandra's auth. schema uses the same table for managing
roles and authentication information, some useful functions in
`roles-metadata.hh` have been added to avoid code duplication.
Because a superuser should be able to drop the legacy users tables from
`system_auth` once the cluster has migrated to roles and is functioning
correctly, we remove the restriction on altering anything in the
"system_auth" keyspace. Individual tables in `system_auth` are still
protected later in the function.
When a cluster is upgrading from one that does not support roles to one
that does, some nodes will be running old code which accesses old
metadata and some will be running new code which access new metadata.
With the help of the gossiper `feature` mechanism, clients connecting to
upgraded nodes will be notified (through code in the relevant CQL
statements) that modifications are not allowed until the entire cluster
has upgraded.
auth: Decouple authorization and role management
Access control in Scylla consists of three main modules: authentication,
authorization, and role-management.
Each of these modules is intended to be interchangeable with alternative
implementations. The `auth::service` class composes these modules
together to perform all access-control functionality, including caching.
This architecture implies two main properties of the individual
access-control modules:
- Independence of modules. An implementation of authentication should
have no dependence or knowledge of authorization or role-management,
for example.
- Simplicity of implementing the interface. Functionality that is common
to all implementations should not have to be duplicated in each
implementation. The abstract interface for a module should capture
only the differences between particular implementations.
Previously, the authorization interface depended on an instance of
`auth::service` for certain operations, since it required aggregation
over all the roles granted to a particular role or required checking if
a given role had superuser.
This change decouples authorization entirely from role-management: the
authorizer now manages only permissions granted directly to a role, and
not those inherited through other roles.
When a query needs to be authorized, `auth::service::get_permissions`
first uses the role manager to check if the role has superuser. Then, it
aggregates calls to `auth::authorizer::authorize` for each role granted
to the role (again, from the role-manager) to determine the sum-total
permission set. This information is cached for future queries.
This structure allows for easier error handling and
management (something I hope to improve in the future for both the
authorizer and authenticator interfaces), easier system testing, easier
implementation of the abstract interfaces, and clearer system
boundaries (so the code is easier to grok).
Some authorizers, like the "TransitionalAuthorizer", grant permissions
to anonymous users. Therefore, we could not unconditionally authorize an
empty permission set in `auth::service` for anonymous users. To account
for this, the interface of the authorizer has changed to accept an
optional name in `authorize`.
One additional notable change to the authorizer is the
`auth::authorizer::list`: previously, the filtering happened at the CQL
query layer and depended on the roles granted to the role in question.
I've changed the function to simply query for all roles and I do the
filtering in `auth::system` in-memory with the STL. This was necessary
to allow the authorizer to be decoupled from role-management. This
function is only called for LIST PERMISSIONS (so performance is not a
concern), and it significantly reduces demand on the implementation.
Finally, we unconditionally create a user in `cql_test_env` since
authorization requires its existence.
the value for the `role` column is equal to the value for the `username`
column.
This change makes LIST PERMISSIONS backwards compatible with clients
that expect the `username` column to exist. This functionality also
exists in Apache Cassandra.
This patch replaces duplicated code for checking the existence of a user
with the same mechanism for doing so as elsewhere: by checking for
`auth::nonexistent_role` being thrown during the course of checking
access-control.
This patch also ensures that exceptions thrown while querying the list
of permissions on a resource get handled correctly.
The fixed dtests which only failed due to differences in wording and
grammar for error messages are:
- altering_nonexistent_user_throws_exception_test
- cant_create_existing_user_test
- dropping_nonexistent_user_throws_exception_test
- users_cant_alter_their_superuser_status_test
This patch ensures that all the CQL statements for managing roles
correctly catch exceptions in the underlying `role_manager` and re-throw
them as top-level exceptions (like "invalid request").
This patch also refines exception handling so that only the applicable
errors are explicitly caught. This should allow easier auditing in the
future and help to reveal faulty assumptions.
Previously, a "data" auth. resource knew how to check it's own existence by
accessing a global variable.
This patch accomplishes two things: it adds existence checking to all
kinds of resources, and moves these checks outside of `auth::resource`
itself and into `auth::service` (so that global variables are no longer
accessed).
According to the Seastar convention, a parameter passed to a function
taking a reference parameter must live for the duration of the execution
of the returned future.
When possible, variables are statically allocated. When this is not
possible, we use `do_with`.
When a user executes GRANT or REVOKE, Scylla ensures that they
themselves are granted the permissions they are changing.
The code previously checked a static list of permissions, which we could
have replaced with `auth::permissions::ALL`. Even better, we now expand
the set of filtered permissions into an iterable container.
Sometimes it is useful to be able to query for all the members of an
`enum_set`, rather than just add, remove, and query for membership. (The
patch following this one makes use of this in the auth. sub-system).
We use the bitset iterator in Seastar to help with the implementation.
`super_enum::valid_is_valid_sequence` determines if the numeric index
corresponding to an enumeration value is valid. This is important,
because it is undefined behavior to cast an invalid index into an
enumeration value.
This function is used to check the validity of the `enum_set` mask when
an `enum_set` is constructed in `enum_set::from_mask`. If the mask has
set bits that correspond to invalid enumeration indicies, then we throw
`bad_enum_set_mask`.
This has the dual benefit of not enforcing copying on implementations of
the abstract interface and also limiting unnecessary copies.
As usual with Seastar, we follow the convention that a reference
parameter to a function is assumed valid for the duration of the
`future` that is returned. `do_with` helps here.
By adding some constants for root resources, we can avoid using
`seastar::do_with` at some call-sites involving `resource` instances.
All authorization checking lives in the CQL layer. The individual
authenticator, authorizer, and role-manager enforce no access-checks.
It may be a good idea to move these checks a level downward in the
future for ease of testing, but for now we aim for consistency.
While it's undefined behavior to pass an unsupported option to a
specific authenticator directly, the `auth::service` layer will check
options and throw this exception. It is turned into a
`invalid_request_exception` by the CQL layer.
The motivation behind this change is the idea that constructing a new
instance of an object is the job of the constructor.
One big benefit of this structure (with the addition of helpers for
convenience) is that calls for emplacing instances (like
`std::make_shared`, or `std::vector::emplace_back`) work without any
difficulty. This would not be true for static construction functions.
According to previous discussions on the mailing-list with Avi, using
both has the benefits of making virtual functions stand out and also
warning about functions which unintentionally do not override.
All we require are value semantics.
`client_state` still stores `authenticated_user` in a `shared_ptr`, but
the behavior of that class is complex enough to warrant its own
discussion/design/refactor.
The most important change is replacing `auth::authenticated_user::name`
with a public `std::optional<sstring>` member. Anonymous users have no
name. This replaces the insecure and bug-prone special-string of
"anonymous" for anonymous users, which does unfortunate things with the
authorizer.
The new `auth::is_anonymous` function exists for convenience since
checking the absence of a `std::optional` value can be tedious.
When a caller really wants a name unconditionally, a new stream output
function is also available.
Checking if the role to be dropped has superuser requires that the role
exists, which means `auth::nonexistent_role` was thrown even when IF
EXISTS was specified.
This is a large change, but it's a necessary evil.
This change brings us to a minimally-functional implementation of roles.
There are many additional changes that are necessary, including refined
grammar, bug fixes, code hygiene, and internal code structure changes.
In the interest of keeping this patch somewhat read-able, those changes
will come in subsequent patches. Until that time, roles are still marked
"unimplemented".
IMPORTANT: This code does not include any mechanism for transitioning a
cluster from user-based access-control to role-based access control. All
existing access-control metadata will be ignored (though not deleted).
Specific changes:
- All user-specific CQL statements now delegate to their roles
equivalent. The statements are effectively the same, but CREATE USER
will include LOGIN automatically. Also, LIST USERS only lists roles
with LOGIN.
- A call to LIST PERMISSIONS will now also list permissions of roles
that have been granted to the caller, in addition to permissions which
have been granted directly.
- Much of the logic of creating, altering, and deleting roles has been
moved to `auth::service`, since these operations require cooperation
between the authenticator, authorizer, and role-manager.
- LIST USERS actually works as expected now (fixes#2968).
The previous code has an off-by-one error since the iterator is
incremented unconditionally prior to being compared to the end of the
collection.
This new version is also shorter thanks to `seastar::do_until`.
The components of access-control (authentication, authorization, and
role-management) are designed as abstract interfaces, but due to
decisions of Apache Cassandra, certain implementations are dependent on
other particular implementations.
This change throws a new exception,
`auth::incompatible_module_combination`, when a dependency is not
satisfied.
The set of allowed options is quite small, so we benefit from a static
representation (member variables) over a dynamic map.
We also logically move the "OPTIONS" option to the domain of the
authenticator (from user management), since this is where it is applied.
This refactor also aims to reduce compilation time by moving
`authentication_options` into its own header file.
While changes to `user_options` were necessary to accommodate the new
structure, that class will be deprecated shortly in the switch to roles.
Therefore, the changes are strictly temporary.
cache_entry constructor was marked noexcept, yet make_evictable() may
fail in rare cases due to allocation in add_version(). Lift the
annotation and make sure that construction has strong exception
guarantees for the moved-in state so that it can be retried without
data loss inside allocating section.
Failure could be injected into the compactor if the main code under
test defers before reaching allocation failure point, and compactor
gets hit. This is not what the test is supposed to stress, and it
causes abort when memtable_snapshot_source is destroyed, so disable
failure injection there.
This patch adds a function that returns an array with the ids of the
active repairs by filtering the RUNNING ones in the repair tracker status.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When the cluster is large or the num_tokens is big, calculate_pending_ranges
can take long time to complete. It now runs in the gossip thread so it can
block the gossip processing. Another problem is it runs in a plain for loop and
can cause the reactor stall.
User see this stall with decommission operations.
I can reproduce up to 4 seconds stall within a two-node cluster each with
`--num-tokens 3072` during decommission.
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout
Fixes#3203
* tag 'asias/issue_3203_v2.1' of github.com:scylladb/seastar-dev:
storage_service: Do not wait for update_pending_ranges in handle_state_leaving
token_metadata: Handle affected_ranges with do_for_each
token_metadata: Split token_metadata::calculate_pending_ranges
token_metadata: Futurize calculate_pending_ranges
storage_service: Futurize storage_service::do_update_pending_ranges
token_metadata: Speed up token_metadata::get_endpoint
The call chain is:
storage_service::on_change() -> storage_service::handle_state_leaving()
-> storage_service::update_pending_ranges()
Listeners run as part of gossip message processing, which is
serialized. This means we won't be processing any gossip messages until
update_pending_ranges completes. update_pending_ranges takes time to
complete.
Since we do not wait for update_pending_ranges to complete any more,
multiple update_pending_ranges operations can run at the same time, use
serialized_action to serialize it.
Tested with update_cluster_layout_tests.py
affected_ranges can be very large in a large cluster or node with big
num_tokens account. calculate_natural_endpoints takes more time to
process in this case as well.
Futurize calculate_pending_ranges_for_leaving and handle the loop with
do_for_each to give some time for the reactor to breath, so it does not
block.
token_metadata::calculate_pending_ranges is a complicated function.
Split it into 3 parts for leaving operation, moving opeartion,
bootstrap opeartion.
Now, do_update_pending_ranges is futurized. We can finally futurize
token_metadata::calculate_pending_ranges in order to convert the loops
inside it to do_for_each insead of plain for loops to avoid reactor
stall.
Preparation work for the futurizing of the time consuming
token_metadata::calculate_pending_ranges.
In addition, we use do_for_each for the loop. It is better than the
plain for loop because the reactor can yield to avoid stalls in cases
there are tons of keyspaces.
The token_to_endpoint map can get big that trying to convert it to a
vector will cause large allocation warning.
This patch replace the implementation, so the return json array will be
created directly from the map by using stream_range_as_array helper
function.
Fixes#3185
Message-Id: <20180207153306.30921-1-amnon@scylladb.com>
Container indices are size_t, and in other places we gratuituously
declare a limit as unsigned and the loop index as signed.
Tests: unit (release)
Message-Id: <20180212121642.10525-1-avi@scylladb.com>
All of the adjustments to _remain already ensure it is greater than 0,
and indeed a negative _remain doesn't make sense.
Switching to an unsigne types allows us to re-enable -Wsign-compare.
Tests: unit (release)
Message-Id: <20180212121636.10463-1-avi@scylladb.com>
Commit cce1a2bce8 ("Use the CPU scheduler")
placed some compaction manager code in a scheduling_group. Unfortunately,
downstream code relied on the callers not deferring, so it can rely
on the column_family's existence. That doesn't happen if the column_family
is removed quickly, as with_scheduling_group() always defers.
Fix applying the scheduling group after we've taken the lock and guaranteed
the stability of the column_family object.
Fixes#3196.
Message-Id: <20180211165155.18179-1-avi@scylladb.com>
71495691aa removed sstable::get_index_reader(),
but forgot to update its callers in tests/. Update the callers to construct
a temporary shared_index_list and create the index_reader directly.
This is none too clean, but shared_index_lists needs to be retired, and then
the changes in this patch can go away too.
Tests: unit (release)
Message-Id: <20180211164739.17862-1-avi@scylladb.com>
With the changes introduced in #2981, it is no longer safe to share
index_entries among multiple sstable_mutation_readers.
The original intent behind sharing index_entries among index_readers was
to avoid re-reading same pages twice as we have two index readers -
lower and upper bound - for every sstable_mutation_reader. In fact, the
shared entries were held at the sstable object level so index_readers
from different sstable_mutation_readers could have accessed them.
Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos),
index_entry can be accessed in a way that modifies its state if we need
to read more promoted index blocks. It is safe to keep sharing them
between two index_readers within the same sstable_mutation_reader as the
invariant is maintained that readers can be only moved forward.
We cannot safely assume, however, that this invariant holds for multiple
sstable_mutation_readers as it may happen that one of them has read and
thrown away some promoted index blocks that another one needs. So we
restrict sharing to per-sstable_mutation_reader level.
Fixes#3189.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.
Tests:
- unit: release mode
- dtest: resharding_test.py"
* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
Remove SSTable's atomic deletion manager
Stop using SSTable's atomic deletion manager
database: split column_family::rebuild_sstable_list
"We have noticed in the past that the compiler is too conservative when it comes
to deciding which functions to inline. Since inlining functions enables further
optimisations such as const folding in some cases the difference in performance
was significant enough to force us to add [[gnu::always_inline]] attribute in
numerous places. However, this is neither a partical nor an elegant solution.
A better way to deal with the problem is to adjust the compiler tunables that
control the heuristics used for making inlining decisions. In particular,
inline-unit-growth seems to affect the performance of the emitted code most.
Apart from making the compiler more eager to inline functions bumping the
optimisation level to -O3 also seems to have a positive impact on the
performance.
Fixes#1644.
Tests: unit-test (release)
Performance tested with gcc 7.3.
Macrobenchmark
perf_simple_query
Flags: -c4 --duration 60
All results are medians.
./before ./after diff
read 338662.12 405377.80 19.7%
write 387378.89 466744.15 20.5%
Microbenchmarks
single run duration: 1.000s
number of runs: 5
BEFORE
test iterations median mad min max
combined.one_row 858933 536.389ns 0.819ns 534.823ns 537.208ns
combined.single_active 8469 77.131us 11.000ns 77.118us 77.145us
combined.many_overlapping 1199 664.105us 160.807ns 663.818us 668.527us
combined.disjoint_interleaved 8100 75.522us 22.254ns 75.500us 75.732us
combined.disjoint_ranges 8288 72.580us 10.571ns 72.568us 72.599us
memtable.one_partition_one_row 1216233 825.581ns 0.446ns 821.450ns 826.027ns
memtable.one_partition_many_rows 127336 7.855us 2.153ns 7.853us 7.898us
memtable.many_partitions_one_row 57919 17.356us 6.028ns 17.259us 17.362us
memtable.many_partitions_many_rows 4751 210.496us 102.339ns 210.393us 211.188us
AFTER
test iterations median mad min max
combined.one_row 1002321 450.292ns 0.313ns 447.202ns 450.605ns
combined.single_active 9605 67.086us 8.620ns 67.073us 67.115us
combined.many_overlapping 1476 519.554us 5.334ns 519.549us 519.953us
combined.disjoint_interleaved 9280 64.363us 5.328ns 64.335us 64.369us
combined.disjoint_ranges 9481 61.893us 3.620ns 61.885us 61.903us
memtable.one_partition_one_row 1432668 699.775ns 0.106ns 696.023ns 699.918ns
memtable.one_partition_many_rows 153692 6.536us 6.885ns 6.501us 6.543us
memtable.many_partitions_one_row 63319 15.879us 5.080ns 15.793us 15.884us
memtable.many_partitions_many_rows 5659 176.717us 66.770ns 176.650us 177.778us"
* tag 'optimise-and-inline/v2' of https://github.com/pdziepak/scylla:
configure.py: set optimisation level to -O3
configure.py: set inline-unit-growth to 300
configure.py: flag_supported: support flags with spaces
configure.py: rename warning_supported to flag_supported
configure.py: pass optimisation flags to seastar/configure.py
cql3/select_statement: do not capture stack variables by reference
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.
What you see here is the result of merging that work.
After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.
We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.
* git@github.com:glommer/scylla.git cpusched-v7:
Avi Kivity (4):
database, sstables, compaction: convert use of thread_scheduling_group
to seastar cpu scheduler
memtable, database: make memtable::clear_gently() inherit
scheduling_group
config: mark background_writer_scheduling_quota as Unused
database: place data_query execution stage into scheduling_group
Glauber Costa (9):
database, main: set up scheduling_groups for our main tasks
row_cache: actually use the scheduling group for update_cache
allow update_cache and clear_gently to use the entire task quota.
database: remove cpu_flush_quota metric
controllers: retire auto_adjust_flush_quota
controllers: allow memtable I/O controller to have shares statically
set
controllers: update control points for memtable I/O controller
controllers: allow a static priority to override the controller output
controllers: unify the I/O and CPU controllers
It has been discovered that the compiler is too conservative when
deciding which functions to inline. In particular, the limiting tunable
turned out to be inline-unit-growth which limits inlining in large
translation units.
* seastar 6d02263...2b0a81d (7):
> configure.py: add -Wno-stringop-overflow
> configure.py: add --optflags for specifying optimisation flags
> build: add protobuf-compiler to docker dev image
> build: update docker builder to newer Fedora
> json_element: stream_object to get its parameter by value
> json_element: stream range object
> build: add yaml-cpp-devel installation to Dockerfile
The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The motivation is that resharding will not want the code that is
specific to regular compaction after atomic deletion is removed.
Resharding will eventually only need to replace old tables with
new ones, and it will be in charge of deletion of old tables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We have had so far an I/O controller, for compactions and memtables, and
a CPU controller, for memtables only -- since the scheduling was still
quota-based.
Now that the CPU scheduler is fully functional, it is time to do away
with the differences and integrate them both into one. We now have a
memtable controller and a compaction controller, and they control both
CPU and I/O.
In the future, we may want to control processes that don't do one of
them, like cache updates. If that ever happens, we'll try to make
controlling one of them optional. But for now, since the I/O and CPU
controllers for our main two processes would look exactly the same we
should integrate them.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have merged the I/O controller without this, but we want to integrate
the CPU and I/O controllers into one. Currently, the quota can be
statically set for the CPU controller. For now, until we gain more
experience with it we should allow a static value to override the
controller's output as well.
That is particularly important since we don't yet control some
strategies like LCS and the time-based ones. Users in the field may be
using one of those strategies with a static value for background quota.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now CPU and I/O controllers have slightly different control points
for no good reason. Let's use the CPU controller ones as the standard, as
we have been using it in the field for longer and trust it more.
The end goal is to fully integrate them.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It no longer makes sense now that we have the full scheduler +
controllers. In its lieu, we will provide an option to statically set
the controller's shares as a safe guard against us getting this wrong.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have had a quota of partitions to process in clear_gently /
update_cache, so that we don't overwork. However, with those things now
being in their own task group there is no harm in allowing it to run
until we reach a natural preemption point.
While we are at it, clear_gently did not check for need_preempt()
before, so this patch fixes it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have moved clear_gently from using a seastar::thread's scheduling_group to
using the CPU scheduler's. However, update_cache was forgotten.
This patch fixes that and gets rid of the old group just in case.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Because execution stages defer and batch processing of the function
they run, they escape their fiber's context and therefore the
scheduling group.
Fix (for data_query) by initializing the execution_stage with the
query scheduling_group. To do that we have to move the execution
stage into the database object, so it has access to the scheduling
group during initialization.
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.
The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.
Comments from Glauber:
This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:
1) A bug/regression is fixed with the boundary calculations for the
memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
themselves are run in the compaction scheduling group. Having that
working is what changes this patch the most: we now store the
scheduling group in the compaction manager and let the compaction
manager itself enforce the scheduling group.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.
The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.
Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
The SSTable tests are a bit fragile now because they rely on min_threshold
having a particular value. That is the default value, but if I change that
default - which I am planning to do - the test breaks.
Right now the test is not broken, but if we are planning on relying on a
property having a particular value in tests, we should explicitly set it.
So I am proactively chaning min_threshold in the tests to have the value
of 4 explicitly, so we can change that in the future without breaking anything.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180207155513.12498-1-glauber@scylladb.com>
Since we unconditionally running blkdiscard on disks, we may get ioctl error
message on some disks which does not support TRIM.
This can be ignore but it's bad UX, so let's skip running blkdiscard when TRIM
is not supported on the disk.
Fixes#2774
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517992904-13838-1-git-send-email-syuu@scylladb.com>
Parses the extension map in tables/views using the registered extension.
If a schema row contains an unknown extension, we just preserve the data
in a placeholder.
Automatically accept registered schema extensions into the properties
set, and when building, generate the corresponding extension object into
the resulting schema.
Requires "workaround" fix for schema_registry and frozen_mutation, since
the former is a free-float thread local, and the latter is a pure data
carrier. frozen_schema can take a parameter for unfreeze, but schema
registry requires being told which the system extensions are.
Move the configurables to init so tests can link this as well.
Add extensions object to db config in main and provide to
configurables. These can then add extensions at this phase.
The idea being that we should have config be a global, immutable
singleton, set up by startup/test then owned/referenced by db etc.
Extensions are read-only in this context, so init code should set it up
before handing to the config. Or keep a ref to the ext param.
Make a "compressor" an actual class, that can be implemented and
registered via class registry.
For "common" compressors, the objects will be shared, but complex
implementors can be semi-stateful.
sstable compression is split into two parts: The "static" config
which is shared across shards, and a "local" one, which holds
a compressor pointer. The latter is encapsulated, along with
actual compressed data writers, in sstables/compress.cc.
For compression (write), compression writer is instansiated
with the settings active in table metadata.
For decompression (read), compression reader is instansiated
with the settings stored in sstable metadata, which can
differ from the currently active table metadata.
v2:
* Structured patch sets differently (dependencies)
* Added more comments/api descs
* Added patch to move all sstable compression into compress.cc,
effectively separating top-level virtual compressor object
from sstable io knowledge
v3:
* Rebased
v4:
* Moved all sstable compression logic/knowledge into
compress.cc (local compression). Merged the two patches
(separation just confuses reader).
Fix clustering column indexing by lifting the limitation of only
considering non-primary key restrictions in
select_statement::find_index_partition_ranges().
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.
Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.
Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.
Introduced in ca8e3c4, so affects 2.1+
Fixes#3186.
Tests: unit (release)"
* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
tests: mvcc: Add test for eviction with non-evictable snapshots
mutation_partition: Define + operator on tombstones
tests: mvcc: Check that partition is fully discontinuous after eviction
tests: row_cache: Add test for memtable readers surviving flush and eviction
memtable: Make printable
mvcc: Take partition_entry by const ref in operator<<()
mvcc: Do not evict from non-evictable snapshots
mvcc: Drop unnecessary assignment to partition_snapshot::_version
tests: Use partition_entry::make_evictable() where appropriate
mvcc: Encapsulate construction of evictable entries
When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.
Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.
Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry.
Introduced in ca8e3c4, so affects 2.1+
Fixes#3186.
merge_partition_versions() is responsible for merging versions
unpinned by the current snapshot. If that fails, we don't need to set
_version back since versions must be still referenced by someone else,
this snapshot is not a unique owner.
This change makes it easier to add tracking of evictability.
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.
So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.
We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.
Tests: release mode.
Fixes#3148.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
Internal invariants of MVCC are better preserved by partition_entry
methods, so move construction of partition entries out of cache_entry
constructors.
Uncomment desscriptions of Ec2SnitchXXX which are supported for a long
time already.
Add the description of the new GoogleCloudSnitch.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Tests the GoogleCloudSnitch.
Uses the dummy GCE meta server that would be listening on 127.0.0.1:80 by default.
To change the IP of the dummy server one can use the DUMMY_META_SERVER_IP
environment macro.
To use the real GCE meta server (from inside the GCE VM) one should define
the USE_GCE_META_SERVER environment macro.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This is a snitch that should be used when Scylla runs in GCE VMs in both
single and multi data center (DC) configurations.
This snitch interacts with the GCE (instance metadata) API as
described here: https://cloud.google.com/compute/docs/storing-retrieving-metadata)
similarly to how ec2_snitchXXX interacts with the AWS API.
However unlike ec2_multi_region_snitch the GCE snitch only gets the instance's zone and sets
the DC and the RACK based on it, e.g. for us-central1-a the DC is set to 'us-central'
and the RACK - to 'a'.
GCE snitch doesn't have to learn the internal and external IPs of the instance because in
GCE instances from different regions can interact using internal IPs (in the AWS they can't).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
* seastar 21badbd...6d02263 (4):
> build: detect name of ninja executable
> queue: pop_eventually/push_eventually should throw when called after abort
> build: compile libfmt out-of-line
> core/gate: Ensure with_gate leaves gate on exception
We uses AmbientCapabilities directive on systemd unit, but it does not work
on older kernel, causes following error:
"systemd[5370]: Failed at step CAPABILITIES spawning /usr/bin/scylla: Invalid argument"
It only works on kernel-3.10.0-514 == CentOS7.3 or later, block installing rpm
to prevent the error.
Fixes#3176
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517822764-2684-1-git-send-email-syuu@scylladb.com>
* seastar 19efbd9...21badbd (4):
> reactor: change adjustment method for tasks becoming active
> Merge 'Update ARM port' from Avi
> http: Do not wait for close connection on stop if listen did not completed
> core/future-util: Don't allow rvalues in do_for_each()
cql_query_test contains many continuations that are generic lambdas:
foo().then([] (auto x) { ... })
These templates prevent Eclipse's indexer from inferring the type of x,
and so everything below that point is one big error as far as Eclipse is
concerned.
De-template these lambdas by specifying the real types.
Unfortunately, compile time decrease was not observed.
Tests: cql_query_test (release)
Message-Id: <20180204113503.23297-1-avi@scylladb.com>
This patch adds a enging().on_exit cleanup for the prometheus server,
similar to other components in the system.
It will stop the server when sutting down.
Fixes#2520
Message-Id: <20180201132647.17638-1-amnon@scylladb.com>
These patches deal with the remaining exception safety issues in the
memtable partition range readers. That includes moving the assignment
to iterator_reader::_last outside of allocating section to avoid
problems caused by exception-unsafe assignment operator. Memory
accotuning code is also moved out of the retryable context to improve
the code robustness and avoid potential problems in the future.
Fixes#3172.
Tests: unit-test (release)
* https://github.com/pdziepak/scylla.git memtable-range-read-exception-safety/v1:
memtable: do not update iterator_reader::_last in alloc section
memtable: do not change accounting state in alloc section
tests/memtable: add more reader exception safety tests
Even if shared_ptr is const it doesn't mean that its internal state is
immutable and it still cannot be freely shared across shards.
Fixes assertion failure in build/debug/tests/cql_roles_query_test.
Message-Id: <20180201125221.30531-1-pdziepak@scylladb.com>
Shared pointer don't like being shared across shards.
Fixes assertion failure in build/debug/tests/mutation_reader_test.
Message-Id: <20180201125017.30259-1-pdziepak@scylladb.com>
Set the option that enables the underlying memtable and cache readers
to request caching of a cell's hash, for requests that require a
digest.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.
If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.
Fixes#2884
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
While not strictly needed, specify which algorithm to use when request
a digest from a remote node. This is more flexible than relying on a
cluster wide feature, although that's what we'll do in subsequent
patches. It also makes the verb more consistent with the data request.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Introduce the digest_algorithm() function, which encapsulates the
decision of which digest algorithm to use. Right now it is set to MD5,
but future patches will change this.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When digest is requested, pre-calculate the cell's hash. We consider
the case when the cell is already in the cache, and the case when it
added by the underlying reader.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When digest is requested, pre-calculate the cell's hash. A downside of
this approach is that more work will be done when there are multiple
versions of a row that contain values for the same cell, but we expect
these cases to be rare and the upside of caching a cell's hash to
compensate for the extra work.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Having this option enables us to communicate from the upper to the
lower layers whether a digest was requested, so that we can pre-calculate
and cache a cell's hash in the readers that have access to the actual
in-memory cells (within the memtable and the row cache).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This entails doing the cell hash calculation slightly differently,
where the cell is hashed individually, the resulting hash being added
to the running one.
Instead of propagating a flag all through the call chain, we detect
whether we are in the new mode by the employed hash algorithm.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This enables us to only branch once per row on the actual hash
algorithm, instead of once per row data item.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We add storage to a row to hold the cached hashes of each individual
cell. We don't store the hash in each cell because that would a)
change the cell equality function, and b) require us to change a cell
in a potentially fragmented buffer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch forces the size of vector_storage's internal storage to 5,
meaning that the underlying managed_vector will ensure it doesn't need
to externally allocate a buffer to hold the row, if only its first 5
cells are set.
We define this size explicitly so we can change the vector's value
type in upcoming patches without affecting the optimization.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Replace the atomic_cell_or_collection::feed_hash() member function
with the specialization of appending_hash, and use that instead.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Use the digester class instead of md5_hasher to encapsulate the
decision of which hash algorithm to use.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Replace range_tombstone::feed_hash() with the specialization of
appending_hash, so that we can use the general feed_hash() function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Replace the feed_hash() member function of partition_key and
clustering_key_prefix with the specialization of appending_hash,
so that we can use the general feed_hash() function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Introduce class result_options to carry result options through the
request pipeline, which at this point mean the result type and the
digest algorithm. This class allows us to encapsulate the concrete
digest algorithm to use.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch paves the way for us to encapsulate the actual digest
algorithm used for a query. The digester class dispatches to a
concrete implementation based on the digest algorithm being used. It
wraps the xxHash algorithm to provide a 128 bit hash, which is the
size of digest expected by the inter-node protocol.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces xx_hasher, a class conforming to the Hasher
concept, which will be used to calculate the data digest in subsequent
patches. It is expected to be an order of magnitude faster than md5.
We use the 64 bit variant of the algorithm, the 128 bit one still
being under development.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This series adds the base for the V2 Swagger definition file.
After the series, the definition file will be at:
http://localhost:10000/v2
It can be used with the swagger ui, by replacing the url in the search
path."
* 'amnon/swagger_20' of github.com:scylladb/seastar-dev:
Register the API V2 swagger file
Adding the header part of the swagger2.0 API
"Before this patch set, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, with two
different rows in the view table, instead of just one with the latest
data. In this series we add locking which serializes the two conflicting
updates, and solves this problem.
I explain in more detail why such locking is needed, and what kinds of
locks are needed, in the third patch."
* 'master' of https://github.com/nyh/scylla:
Materialized views: serialize read-modify-update of base table
Materialized views: test row_locker class
Materialized views: implement row and partition locking mechanism
These patches change the memtable reader implementation (in particular
partition_snapshot_reader) so that the existing exception safety
paroblems are fixed, but also in a way that, hopefully, would make it
easier to reason about the error handling and avoid future bugs in that
area.
The main difficulty related to exception safety is that when an
exception is thrown out of an allocating section that code is run again
with increased memory reserved. If the retryable code has side effects
it is very easy to get incorrect behaviour.
In addition to that, entering an allocating section is not exactly cheap
which encourages doing so rarely and having large sections.
The approach taken by this series is to, first, make entering allocating
sections cheaper and then reducing the amount of logic that runs inside
of them to a minimum.
This means that instead of entering a section once per a call to
flat_mutation_reader::fill_buffer() the allocation section is entered
once for each emitted row. The only state modified from within the
section are cached iterators to the current row, which are dropped on
retry. Hopefully, this would make the reader code easier to reason
about.
The optimisations to the allocating sections and managed_bytes
linearised context has successfully eliminated any penalty caused by
much more fine grained allocating sections.
Fixes#3123.
Fixes#3133.
Tests: unit-tests (release)
BEFORE
test iterations median mad min max
memtable.one_partition_one_row 1155362 869.139ns 0.282ns 868.465ns 873.253ns
memtable.one_partition_many_rows 127252 7.871us 15.252ns 7.851us 7.886us
memtable.many_partitions_one_row 58715 17.109us 2.765ns 17.013us 17.112us
memtable.many_partitions_many_rows 4839 206.717us 212.385ns 206.505us 207.448us
AFTER
test iterations median mad min max
memtable.one_partition_one_row 1194453 839.223ns 0.503ns 834.952ns 842.841ns
memtable.one_partition_many_rows 133785 7.477us 4.492ns 7.473us 7.507us
memtable.many_partitions_one_row 60267 16.680us 18.027ns 16.592us 16.700us
memtable.many_partitions_many_rows 4975 201.048us 144.929ns 200.822us 201.699us
./before_sq ./after_sq diff
read 337373.86 353694.24 4.8%
write 388759.99 394135.78 1.4%
* https://github.com/pdziepak/scylla.git memtable-exception-safety/v2:
tests/perf: add microbenchmarks for memtable reader
flat_mutation_reader: add allocation point in push_mutation_fragment
linearization_context: remove non-trivial operations from fast path
lsa: split alloc section into reserving and reclamation-disabled parts
lsa: optimise disabling reclamation and invalidation counter
mutation_fragment: allow creating clustering row in place
paratition_snapshot_reader: minimise amount of retryable code
memtable: drop memtable_entry::read()
tests/memtable: add test for reader exception safety
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
Moving clustering_row is expensive due to amount of data stored
internally. Adding a mutation_fragment constructor that builds a
clustering_row in-place saves some of that moving.
Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.
However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.
In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.
Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.
In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
Since linearization_context is thread_local every time it is accessed
the compiler needs to emit code that checks if it was already
constructed and does so if it wasn't. Moreover, upon leaving the context
from the outermost scope the map needs to be cleared.
All these operations impose some performance overhead and aren't really
necessary if no buffers were linearised (the expected case). This patch
rearranges the code so that lineatization_context is trivially
constructible and the map is cleared only if it was modified.
Exception safety tests inject a failure at every allocation and verify
whether the error is handled properly.
push_mutation_fragment() adds a mutation fragment to a circular_buffer,
in theory any call to that function can result in a memory allocation,
but in practice that depends on the implementation details. In order to
improve the effectiveness of the exception safety tests this patch adds
an explicit allocation point in push_mutation_fragment().
"This patchset makes index_reader consume promoted index incrementally
on demand as the reader advances through the current partition instead
of storing the entire promoted index which can be huge.
When the current page is parsed, data for promoted indices are turned
into input streams that are only read and parsed if a particular
position within a partition is seeked for. This avoids potentially large
allocations for big partitions."
* 'issues/2981/v10' of https://github.com/argenet/scylla:
Use advance_past for single partition upper bound.
Remove obsolete types and methods.
Simplify continuous_data_consumer::consume_input() interface.
Parse promoted index entries lazily upon request rather than immediately.
Add helper input streams: buffer_input_stream and prepended_input_stream.
Support skipping over bytes from input stream in parsers based on continuous_data_consumer
Add performance tests for large partition slicing using clustering keys.
Before this patch, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, in two
different rows added to the view table, instead of just one with the latest
data. In this patch we we add locking which serializes the two conflicting
updates, and solves this problem. The locking for a single base-table
column_family is implemented by the row_locker class introduced in a
previous patch.
A long comment in the code of this patch explains in more detail why
this locking is needed, when, and what types of locks are needed: We
sometimes need to lock a single clustering row, sometimes an entire
partition, sometimes an exclusive lock and sometimes a shared lock.
Fixes#3168
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is a unit test for the row_locker facility. It tests various
combination of shared and exclusive locks on rows and on partitions,
some should succeed immediately and some should block.
This tests the row_locker's API only, it does not use or test anything
in Materialized Views.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a "row_locker" class providing locking (shard-locally) of
individual clustering rows or entire partitions, and both exclusive and
shared locks (a.k.a. reader/writer lock).
As we'll see in a following patch, we need this locking capability for
materialized views, to serialize the read-modify-update modifications
which involve the same rows or partitions.
The new row_locker is significantly different from the existing cell_locker.
The two main differences are that 1. row_locker also supports locking the
entire partition, not just individual rows (or cells in them), and that
2. row_locker supports also shared (reader) locks, not just exclusive locks.
For this reason we opted for a new implementation, instead of making large
modificiations to the existing cell_locker. And we put the source files
in the view/ directory, because row_locker's requirements are pretty
specific to the needs of materialized views.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.
When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.
Fixes#3068
Tests: unit-tests (release)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
* seastar 770c450...19efbd9 (3):
> configure.py: add --static-yaml-cpp option to link libyaml-cpp statically
> Merge 'Avoid kernel stalls due to fsync' from Avi
> rwlock: add exception-safe lock/unlock alternative
This patch adds a scripts/find-maintainer script, similar to
script/get_maintainer.pl in Linux, which looks up maintainers and
reviewers for a specific file from a MAINTAINERS file.
Example usage looks as follows:
$ ./scripts/find-maintainer cql3/statements/create_view_statement.cc
CQL QUERY LANGUAGE
Tomasz Grabiec <tgrabiec@scylladb.com> [maintainer]
Pekka Enberg <penberg@scylladb.com> [maintainer]
MATERIALIZED VIEWS
Duarte Nunes <duarte@scylladb.com> [maintainer]
Pekka Enberg <penberg@scylladb.com> [maintainer]
Nadav Har'El <nyh@scylladb.com> [reviewer]
Duarte Nunes <duarte@scylladb.com> [reviewer]
The main objective of this script is to make it easier for people to
find reviewers and maintainers for their patches.
Message-Id: <20180119075556.31441-1-penberg@scylladb.com>
Instead of advancing to the next partition, try first find the more
precise position using promoted index blocks.
advance_past() only seeks within currently available PI blocks (or reads
the first batch, if never read before) and uses the position if found,
otherwise resorts to advance_to_next_partition()
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
These types and methods are no longer in use since the index_reader is
now consuming promoted index incrementally.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Remove redundant input parameter as continuous_data_consumer derivatives
would only use themselves as a context. So take it internally and make
the function regular (non-template) and having no parameters.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Now promoted index is converted into an input_stream and skipped over
instead of being consumed immediately and stored as a single buffer.
The only part that is read right away is the deletion time as it is
likely to be there in the already read buffer and reading it should both
be cheap and prevent from reading the whole promoted index if only
deletion time mark is needed.
When accessed, promoted index is parsed in chunks, buffer by buffer, to
limit memory consumption.
Fixes#2981
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
buffer_input_stream is a simple input_stream wrapping a single
temporary_buffer.
prepended_input_stream suits for the case when some data has been read
into a buffer and the rest is still in a stream. It accepts a buffer and
a data_source and first reads from the buffer and then, when it ends,
proceeds reading from the data_source.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Currently we don't check data_file_directories existance before running iotune,
therefore it's shows unclear error message.
To make the message better, check the directory existance on scylla_io_setup.
Fixes#3137
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517200647-6347-1-git-send-email-syuu@scylladb.com>
* seastar d03896d...770c450 (10):
> tls_test: Fix echo test not setting server trust store
> tls: Do not restrict re-handshake to client
> tls: Actually verify client certificate if requested
> rwlock: add method for determining if an rwlock is locked
> metrics: Add missing `break` to metric_value::operator+()
> memory: fix error injector throwing from noexcept memory allocator functions
> systemwide_memory_barrier: don't use mprotect() on ARM
> sharded: Add const version of sharded::local()
> Add const overloads of front() and back() to the circular_buffer.
> Remove unused lambda captures
Fixes#3072
The exponential_backoff_retry instance is captured by move and is then
indirectly moved again as repeat_until_value() moves the lambda its
passed into its internal state. This caused problems as internal
lambdas store references to the instance and these references go stale
after the move.
To fix this keep hold of the existential_backoff_retry instance in an
enclosing do_with() to make it safe for internal lambdas to reference
it.
Indentation will be fixed by the next patch.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <adc49d25a6176756d60e092f3713c0c897732382.1517222195.git.bdenes@scylladb.com>
It tests mutation_from_streamed_mutation that is no longer
used and will be removed in the next patch.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It tests freeze(streamed_mutation) which is no longer used
and will be removed in the next patch.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The labels in database active_reads metrics where not define correctly.
Label should be created so it will be possible to select based on their
value.
The current implementation define a label "class" with three instances:
user, streaming, system.
Fixes: #2770
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180123125206.23660-1-amnon@scylladb.com>
When a buffer of a flat reader is small then the reader can't
handle range_tombstones correctly.
This is not a problem on a production when the buffer is large.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Before when the buffer was so small that it could fit only a single
range_tombstone, cache_flat_mutation_reader would keep returning
the same tombstone over and over again.
The fix is to set _lower_bound to the next fragment we want to return.
Fixes#3139
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
timeout parameter was captured by reference, and could be accessed out
of scope in case the repeat loop deferred.
Fixes debug-mode failure of flat_mutation_reader_test.
Message-Id: <1516699230-19545-1-git-send-email-tgrabiec@scylladb.com>
The reason sstable key estimation is inaccurate is that it doesn't account that
index sampling is now dynamic.
The estimation is done as follow:
uint64_t get_estimated_key_count() const {
return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) *
_components->summary.header.min_index_interval;
}
The biggest problem is that _components->summary.header.min_index_interval isn't
actually the minimum interval, but instead the default interval value set in the
schema.
So the estimation gets worse the larger the average partition, because the larger
the average partition the lower the index sampling interval.
One of the problems is that estimation has a big influence on bloom filter size,
and so for large partitions we were generating bigger filters than we had to.
From now on, size at full sampling is calculated as if sampling were static
(which was the case until commit 8726ee937d which introduced size-based
sampling), using minimum index as a strict sampling interval.
Tests: units (release)
Fixes#3113.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com>
"It turned out that decimal numbers that were obtained as cast from integers
should always contain just one decimal place 0.
This can be recognised especially when calculating avg(.) over such numbers
because result contains just one decimal point.
Fixes #3111."
* 'danfiala/integers-to-decimal' of github.com:hagrid-the-developer/scylla:
tests: Add test that decimal obtained as CAST from integer always contain one decimal place.
types: Decimal that is obtained from integer always contain one decimal place.
this patch fixes the following remarks:
./defaults.py:2:9: E126 continuation line over-indented for hanging indent
./fake.py:15:1: E305 expected 2 blank lines after class or function definition, found 1
./livedata.py:49:17: F402 import 'metric' from line 5 shadowed by loop variable
./scyllatop.py:44:1: E305 expected 2 blank lines after class or function definition, found 1
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180119162939.17866-1-ultrabug@gentoo.org>
After the new compaction controller code, the monitor has to be kept
alive until the sstable is added to the SSTable set.
This is correctly handled for all the writers, except the streaming big.
That flusher is a big confusing, as it builds an sstable list first and
only later adds the elements in the list to the sstable set. The
monitors are destroyed at the end of phase 1, so we will SIGSEGV later
when calling add_sstable().
The fix for this is to make sure the lifetime of the monitors are tied
to the lifetime of the sstables being handled big the big streaming
flush process.
Caught by dtests, update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test
Fixes#3131
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test now passes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180118202230.17107-1-glauber@scylladb.com>
This adds a registration of the V2 swagger file.
V2 uses the Swagger 2.0 format, the initial definitions is empty and can
be reached at:
http://localhost:10000/v2
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
In Swagger 2.0 all the API is exported as a single file.
The header part of the file, contains general information. It is stored
as an external file so it will be easy to modify when needed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"Changes merging in MVCC to apply newer version to older instead of older to
newer.
Before (v0 = oldest):
(((v3 + v2) + v1) + v0)
After:
(v0 + (v1 + (v2 + v3)))
or:
(((v0 + v1) + v2) + v3)
There are several reasons to do this:
1) When continuity merging will change semantics to support eviction
from older versions, it will be easier to implement apply() if we
can assume that we merge newer to older instead of older to
newer, since newer version may have entries falling into a
continuous interval in older, but not the other way around. If we
didn't revert the order, apply() would have to keep track of
lower bound of a continuous interval in the right-hand side
argument (older version) as it is applied and update continuity
flags in the left hand side by scanning all entries overlapping
with it. If order is reversed, merging only needs to deal with
the current entry. Also, if we were to keep the old order, we
cannot simply move entries from the left hand side as we merge
because we need to keep track of the lower bound of a continuous
interval, and we need to provide monotonic exception
guarantees. So merging would be both more complicated and slower.
2) With large partitions older versions are typically larger than
newer versions, and since merging is O(N_right*(1 + log(N_left))),
it's better to merge newer into older.
This fixes latency spikes seen in perf_cache_eviction.
Fixes #2715."
* tag 'tgrabiec/reverse-order-of-mvcc-version-merging-v1' of github.com:scylladb/seastar-dev:
mvcc: Reverse order of version merging
anchorless_list: Introduce last()
mvcc: Implement partition_entry::upgrade() using squashed()
mvcc: Extract version merging functions
mutation_partition: Add rows_entry::set_dummy()
position_in_partition: Introduce after_key()
$ gcc --version
gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
The following code
struct S
{
S(int i = 42);
};
void f()
{
S( {} );
}
produces this assembly with g++ --std=c++14
lea rax, [rbp-1]
mov esi, 0
mov rdi, rax
call S::S(int)
and this one with g++ --std=c++17
lea rax, [rbp-1]
mov esi, 42
mov rdi, rax
call S::S(int)
For more details about compiler bug, check:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83937
NOTE: clang isn't affected by it.
Test relied on braced initialization of compressor (an enum class)
working properly when used as argument to compression_parameters's
ctor. Braced-initilization of an integer based type should be zero,
but default argument (lz4) was used instead, which means compression
was enabled when it shouldn't.
The course of action is to workaround the bug by explicitly setting
compressor type to none.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180119013655.32564-1-raphaelsc@scylladb.com>
Change merging to apply newer version to older instead of older to
newer.
Before:
(((v3 + v2) + v1) + v0)
After:
(v0 + (v1 + (v2 + v3)))
or equivalent:
(((v0 + v1) + v2) + v3)
There are several reasons to do this:
1) When continuity merging will change semantics to support eviction
from older versions, it will be easier to implement apply() if we
can assume that we merge newer to older instead of older to
newer, since newer version may have entries falling into a
continuous interval in older, but not the other way around. If we
didn't revert the order, apply() would have to keep track of
lower bound of a continuous interval in the right-hand side
argument (older version) as it is applied and update continuity
flags in the left hand side by scanning all entries overlapping
with it. If order is reversed, merging only needs to deal with
the current entry. Also, if we were to keep the old order, we
cannot simply move entries from the left hand side as we merge
because we need to keep track of the lower bound of a continuous
interval, and we need to provide monotonic exception
guarantees. So merging would be both more complicated and slower.
2) With large partitions older versions are typically larger than
newer versions, and since merging is O(N_right*(1 + log(N_left))),
it's better to merge newer into older.
Fixes#2715.
This patch changes single_column_primary_key_restrictions::values() to
return values obtained via components() instead of the serialized form
that's returned by representation(). We need this to turn clustering key
restriction keys into partition keys for clustering key indexed queries.
Raphael recently caught this test failing. I can't really reproduce it,
but it seems to me that it is a timing issue: we execute two different
statements, each one should timeout after 10ms. After 20ms, we make sure
that they both timed out.
They don't (in his system), which is explained by the fact that we are
no longer using high resolution clocks for the timeouts. Expirations for
lowres clocks will only happen at every 10ms, and in the worst case we
will miss twoa.
So the fix I am proposing here is to just account for potential
innacuracies in the clocks and calculations by waiting a bit longer.
Ideally, we would use the manual clock for this. But in this case, this
would mean adding template parameters to pretty much all of the
mutation_reader path.
Currently, not only the test failed, it also had an use-after-free
SIGSEGV. That happens because we give up on the reader while the
timeouts is still to happen.
It is the caller responsibility to ensure the lifetime of the reader is
correct. Dealing with that cleanly would require a cancelation mechanism
that we don't have, so we'll just add an assertion that will fail more
gracefully than the SIGSEGV.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are not propagating timeouts to fast_forward_to in the
mutation_reader_test. This is not currently causing any issue, but I
noticed it while chasing one - so let's fix it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This will make boost UTF abort execution on SIGABRT rather than trying
to continue running other test cases. This doesn't work well with
seastar integration, the suite will hang.
Message-Id: <1516205469-16378-1-git-send-email-tgrabiec@scylladb.com>
Since we sometimes recommend that the user update to a newer kernel,
it's good to compile support for features that the new kernel supports.
Rather than play games with build-time dependencies, just #define
those features in. It's ugly, but better than depending on third-party
repositories and handling package conflicts.
Message-Id: <20180115143129.22190-1-avi@scylladb.com>
"The previous series handled a passing of the copy of the client_state from process_request(...)
to the process_request_one(...). However the modified copy of the client_state is returned by the
process_request_one(...) back to the process_request(...) and handling of this direction was missing
in the previous series.
This series completes the #2351 fix."
* 'fix-round-robin-cont-v2' of https://github.com/vladzcloudius/scylla:
transport::cql_server::process_request_one: return only the required information instead of the whole client_state object
service::client_state: move auth_state from cql_server::connection to service::client_state
transport::cql_server: don't cache sasl_challenge object in the cql_server::connection
service::client_state::merge(): remove not needed timestamp merge
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.
The fix is to always collect such zones.
Fixes#3129
Refs #3119
Refs #3120"
* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
lsa: Expose max_zone_segments for tests
lsa: Expose tracker::non_lsa_used_space()
lsa: Fix memory leak on zone reclaim
_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.
The fix is to always collect such zones.
Fixes#3129
Refs #3119
Refs #3120
The call chain is:
storage_service::on_change() -> storage_service::handle_state_removing()
-> storage_service::restore_replica_count() -> streamer->stream_async()
Listeners run as part of gossip message processing, which is serialized.
This means we won't be processing any gossip messages until streaming
completes.
In fact, there is no need to wait for restore_replica_count to complete
which can take a long time, since when it completes, this node will send
notification to tell the removal_coordinator that the restore process is
finished on this node. This node will be removed from _replicating_nodes
on the removal_coordinator.
Tested with update_cluster_layout_tests.py
Fixes#2886
Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com>
Commit 2d5fb9d109 (gms/gossiper: Replicate changes incrementally to
other shards) changes the way we replicate _token_metadata and
endpoint_state_map. Before they are replicated at the same time, after
they are not any more. This causes a shard in NORMAL status can still be
with a empty _token_metadata.
We saw errors:
[shard 12] token_metadata - sorted_tokens is empty in first_token_index!
during CorruptThenRepairNemesis.
Fix by setting the gossip status to NORMAL after replication of
_token_metadata, so that once a node is in NORMAL, we can do repair. The
commit 69c81bcc87 (repair: Do not allow repair until node is in NORMAL
status) prevents the early repair operation by checking if a node is in
NORMAL status.
Fixes#3121
Message-Id: <af6a223733d2e11351f1fa35f59eacfa7d65dd30.1516065564.git.asias@scylladb.com>
that's required after fa5a26f12d on because sstable write fails when sharding
metadata is empty due to lack of keys that belong to current shard.
make_local_key* were moved to header to avoid compiling sstable_utils.cc into
all those tests that rely on simple_schema.hh, which is a lot.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180116052052.7819-1-raphaelsc@scylladb.com>
client_state used in the process_request_one(...) contains all sorts of information irrelevant
to the caller (process_request(...)), e.g. Tracing state. Therefore instead of returning
the whole client_state object (which becomes even a bigger problem if process_one(...) and process_request_one(...)
are executed on different shards) we will return only the pieces of information we really need.
To do that we introduce a new class - processing_result, which is cross-shard-access-ready to begin with.
We are going to return a instance of this new class from the process_request_one(...).
Fixes#2351
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Move the requests-handling-related state into the client_state. This is needed to properly
define the interface between the process_request(...) and process_request_one(...).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The benefit of such a caching is rather limited because it's likely to be used exactly once
and then destroyed anyway (in case of a successful authentication).
If the authentication has failed no harm is going to be done if we create this object again when
needed.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Since the connection::_client_state is the only generator of new timestamps
now there is no need for this merge.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"After this patchset it's only possible to create a mutation_source with a function that produces flat_mutation_reader."
* 'haaawk/mutation_source_v1' of ssh://github.com/scylladb/seastar-dev:
Merge flat_mutation_reader_mutation_source into mutation_source
Remove unused mutation_reader_mutation_source
Remove unused mutation_source constructor.
Migrate make_source to flat reader
Migrate run_conversion_to_mutation_reader_tests to flat reader
flat_mutation_reader_from_mutations: add support for slicing
Remove unused mutation_source constructor.
Migrate partition_counting_reader to flat reader
Migrate throttled_mutation_source to flat reader
Extract delegating_reader from make_delegating_reader
row_cache_test: call row_cache::make_flat_reader in mutation_sources
Remove unused friend declaration in flat_mutation_reader::impl
Migrate make_source_with to flat reader
Migrate make_empty_mutation_source to flat reader
Remove unused mutation_source constructor
Migrate test_multi_range_reader to flat reader
Remove unused mutation_source constructors
* seastar a7a3e6f...d03896d (11):
> Update dpdk submodule
> Merge "C++17 aligned allocations" from Avi
> Prometheus should check that the iterator is valid before using it
> future-util: failure to allocate internal state is unrecoverable
> Merge "Introduce simple microbenchmarking framework" from Paweł
> tutorial: document debuging ignored exceptions
> Revert "Merge "Introduce simple microbenchmarking framework" from Paweł"
> Merge "Introduce simple microbenchmarking framework" from Paweł
> tests/futures: add more tests for parallel_for_each()
> Add a prometheus.md file
> prometheus: Support metric family name parameter
"Added support for min/max functions over date/timestamp/timeuuid.
There was one issue with Scylla's type system internals: no C++ type
was mapped to these types. So special "native_types" were added for them.
It required some changes to native functions because these types don't support
the same operations as their real native counterparts.
Fixes #3104."
* 'danfiala/3104-v1' of https://github.com/hagrid-the-developer/scylla:
tests: Tests for min/max aggregate functions over date/timestamp and timeuuid.
functions: Added min/max functions for date/timestamp/timeuuid.
types: Added native types for timestamp and timeuuid.
Advertise compatibility with CQL Version 3.3.2, since CAST functions are supported.
Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands
and makes the --jobs argument work as supposed.
Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212904.85907-1-mel@aiven.io>
Test fails after fa5a26f12d because generated sstable doesn't contain data for the
shard it was created at, so sharding metadata is empty, resulting in exception
added in the aforementioned commit. That's fixed by using the new make_local_key()
to generate data that belongs to current shard.
make_local_keys(), from which make_local_key() is built on top of, will be useful
to make sstable test work again with any smp count.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180114032025.26739-1-raphaelsc@scylladb.com>
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.
We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.
This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.
In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.
We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.
After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.
Fixes#2462
* git@github.com:glommer/scylla.git timeouts-v8.1:
database: delete unused function
consolidate timeout_clock
mutation_query: add a timeout to the mutation query path
flat_mutation_reader: pass timeout down to consume()
add a timeout to fill_buffer
add a timeout to fast forward to
restricted_mutation_reader: don't pass timeouts through the config
structure
allow request-specific read timeouts in storage proxy reads
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.
We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.
This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.
In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.
We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.
After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.
Fixes#2462
Implementation notes, and general comments about open discussion in 2462:
* Because we are not bypassing the timeout, just setting it high enough,
I consider the concerns about the batchlog moot: if we fail for any
other reason that will be propagated. Last case, because the timeout
is per-CF, we could do what we do for the dirty memory manager and
move the batchlog alone to use a different timeout setting.
* Storage proxy likes specifying its timeouts as a time_point, whereas
when we get low enough as to deal with the read_concurrency_config,
we are talking about deltas. So at some point we need to convert time_points
to durations. We do that in the database query functions.
v2:
- use per-request instead of per-table timeouts.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch enables passing a timeout to the restricted_mutation_reader
through the read path interface -- using fill_buffer and friends. This
will serve as a basis for having per-timeout requests.
The config structure still has a timeout, but that is so far only used
to actually pass the value to the query interface. Once that starts
coming from the storage proxy layer (next patch) we will remove.
The query callers are patched so that we pass the timeout down. We patch
the callers in database.cc, but leave the streaming ones alone. That can
be safely done because the default for the query path is now no_timeout,
and that is what the streaming code wants. So there is no need to
complicate the interface to allow for passing a timeout that we intend
to disable.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.
In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
As part of the work to enable per-request timeouts, we enable timeouts
in fill_buffer.
The argument is made optional at the main classes, but mandatory in all
the ::impl versions. This way we'll make sure we didn't forget anything.
At this point we're still mostly passing that information around and
don't have any entity that will act on those timeouts. In the next patch
we will wire that up.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We pass the timeout that we received from data_query/mutation_query
down to consume, which is responsible for actually reading the data.
To make those timeouts actionable, though, we'll have to patch
fill_buffer(). This will happen in the next patch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
data_query and mutation_query are patched so that they start accepting a
per-query timeout. We will default to no timeout, and then no callers
will be changed yet.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.
As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.
In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Scylla's configure.py calls seastar/configure.py and uses seastar.pc
that it produces to generate Scylla's build.ninja. However, there is no
appropriate dependency in build.ninja and changes to
seastar/configure.py alone do not trigger regeneration of Scylla's
build.ninja. This patch remedies that problem.
Message-Id: <20180111144237.5259-1-pdziepak@scylladb.com>
On Debian 9, 'pbuilder create' fails because of lack of GPG key for
3rdparty repo, so we need --allow-untrusted on 'pbuilder create' and
'pbuilder update'.
Also, apt-key adv --fetch-keys does not works correctly on it, but we can use
"curl <URL> | apt-key add -" as workaround.
Fixes#3088
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513797714-18067-1-git-send-email-syuu@scylladb.com>
If the compaction_backlog_manager's lifetime ends before the linked
compaction_backlog_tracker's, the latter's _manager pointer not being
cleared, can lead to a use-after-free error when running
~compaction_backlog_tracker(), as evidenced by unit-tests failed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-2-duarte@scylladb.com>
The legacy mutation_reader/streamed_mutation design allowed very easily
to skip the partition merging logic if there was only one underlying
reader that has emitted it.
That optimisation was lost after conversion to flat mutation readers
which has impacted the performance. This patch mostly recovers it by
bypassing most of mutation_reader_merger logic if there is only a single
active reader for a given partition.
The performance regression was introduced in
8731c1bc66 "Flatten the implementation of
combined_mutation_reader".
perf_simple_query -c4 read results (medians of 60):
original regression
before 8731c1 after 8731c1 diff
read 326241.02 300244.09 -8.0%
this patch
before after diff
read 313882.59 325148.05 3.6%
Message-Id: <20180103121019.764-1-pdziepak@scylladb.com>
"This series revives the round-robin load balancing added by Pekka back in 2015.
If somebody tries to enable it with the current master it would quite quickly
lead to a crash due to a few unresolved issues in the corresponding code.
Fixes#2351Fixes#3118"
* 'fix-round-robin-balancing-v2' of github.com:vladzcloudius/scylla:
transport::server::process_request(): avoid extra copy of the client_state
service::cql_server::connection::process_request: use client_state "request copy" constructor
service::client_state: introduce "request copy" copy-constructor
service::storage_service: add the get_local_auth_service() accessor
service::client_state: remove the unused _tracing_session_id field
Don't use submit_to(...) when we are going to handle the request on a local
shard. Otherwise there is a not needed copy of the _client_state in the submit_to(...)
lambda capture list.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Create a cross-shard copy of the client_state object and give it to the single request handling
function and give it a timestamp generated by the original client_state instance (which is promised
to be monotonous).
Fixes#3118
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
A new constructor creates a copy of the current client_status to be
used in the context of the handling of a single request.
The copy may take place at a shard different from the one where the
request has been received.
In order to ensure the monotonicity of the timestamps used by the request handled
on the same connection the created copy of the client_state is going to use the same timestamp provided by the
caller instead of generating it.
It's the caller's responsibility to ensure the monotonicity of given timestamps.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
sprint() may need to allocate significant amount of memory if mutation
is large, and cause bad_alloc in
row_cache_test::test_concurrent_reads_and_eviction.
Message-Id: <1515486454-4913-1-git-send-email-tgrabiec@scylladb.com>
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.
In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.
Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test
Fixes#3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
After 611774b, we're blind again on which sstable caused a compaction
to fail, leaving us with cryptic message as follow:
compaction_manager - compaction failed: std::runtime_error (compressed
chunk failed checksum)
After this change, now both read failure in compaction or regular read
will report the guilty sstable, see:
compaction_manager - compaction failed: std::runtime_error (SSTable reader
found an exception when reading sstable ./data/.../keyspace1-standard1
ka-1-Data.db : std::runtime_error(compressed chunk failed checksum))
Fixes#3006.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com>
"This patchset implements the compaction controller for I/O shares. The
goal is to automatic adjust compaction shares based on a
strategy-specific backlog. A higher backlog will translate into higher
shares.
As compaction progresses, that reduces the backlog. As new data is
flushed, that increases the backlog. The goal of the controler is to
keep the backlog constant at a certain rate, so that we don't go neither
too fast or too slow.
Tracking reads and writes:
==========================
Tracking of reads and writes happen through the read_monitor and the
write_monitor. The write monitor is an existing interface that has the
purpose of releasing the write permit at particular points of the write
process. We enhance it so to get a reference to an instance that tracks
the current offset inside the sstables::file_writer. This way the
backlog tracker can always know for sure what's the offset of the
current write.
A similar thing is done for reads. The data_consumer already tracks the
position of the current read, and we isolate that into a structure to
which we can get a reference. A read_monitor allows us to connect the
compaction to that reference.
Lifetime management:
====================
In general, tracking objects will be owned by their callers and passed
down as references. The compaction object will own the read monitors and
the compaction write monitors and the memtable flush write monitor will
be kept alive in a do_with block around the flush itself.
The backlog_{write,read}_progress_manager needs to be kept alive until
the SSTable is no longer in progress. For writes, that means until we
are able to add the SSTable charges in full, and for reads (compaction)
that means until we are able to remove the charges in full.
It is important to do that to avoid spikes in the graph. If we remove
the progress managers in a different operation than updating the SSTable
list we will be left in a temporary state where charges appear or
disappear abruptly, to be fixed when the final
add_sstable/remove_sstable happens. So we want those things to happen
together.
The compaction_backlog_tracker is kept alive until the strategy changes,
for example, through ALTER TABLE. Current charges are transferred to the
new strategy's compaction_backlog_tracker object when we do that. If the
type of strategy changes, the current read charges are forgotten. We can
do that because those running compaction will not really contribute to
decrease the backlog of the new compaction strategy.
Tranfer of Charges
==================
When ALTER TABLE happens, we need to transfer ongoing writes to the new
backlog manager. Ongoing reads will still be tracked by the
backlog_manager that originated them.
The rationale for that is that reads still belong to the current
compaction, with the strategy that generated them. But new Tables being
written will add to the backlog of the new strategy.
Note that ALTER TABLE operations not necessarily cause a change of
Strategy. We can be using the same strategy but just changing
properties. If that is the case, we expect no discontinuity in the
backlog graph (tested).
Resharding
==========
Resharding compactions are more complex than normal compactions because
the SSTables are created in one shard and later sent to another shard.
It is better, then, to track resharding compactions separately and let
them have their own backlog tracker, which will insert backlog in
proportion to the amount of data to be resharded.
Memtable Flush I/O Controller
=============================
With the current infrastructure it becomes trivial to add a new
controller, for either I/O or CPU. This patchset then adds an I/O
controller for memtable flushes, using the same backlog algorithm that
we already used for CPU."
* 'compaction-controller-io-v5' of github.com:glommer/scylla:
database: add a controller for I/O on memtable flushes.
document the compaction controller
compaction: adjust shares for compactions
backlog_controllers: implement generic I/O controller
factor out some of the controller code
io shares: multiply all shares by 10
compaction_strategy: implement backlog manager for the SizeTiered strategy
infrastructure for backlog estimator for compaction work.
sstables: notify about end of data component write
sstables: add read_monitor_generator
sstables: add read_monitor
sstables: enhance data consumer with a position tracker
sstables: enhance the file_writer with an offset tracker
sstables: pass references instead of pointers for write_monitor
compaction: control destruction of readers
'"The issue is triggered by compaction of sstables of level higher than 0.
The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]
When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.
This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made."
Fixes #2908.'
* 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla:
tests: test for infinite recursion bug when doing high-level compaction
Fix potential infinite recursion when combining mutations for leveled compaction
dht: make it easier to create ring_position_view from token
dht: introduce is_min/max for ring_position
If a node shuts itself down due to I/O error (such as ENOSPC), then
nodetool status will show the cluster status at the time the shutdown
occured.
In fact the node will be in shutdown status (nodetool gossipinfo shows
the correct status), however, `nodetool status` does not interpret the
shutdown status, instead it use the output of:
curl -X GET --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/endpoint/live"
to decide if a node is in UN status.
To fix, do not include the node itself in the output of get_live_members
Without this patch, when a node is shutdown due to I/O error:
UN 127.0.0.1 296.2 MB 256 ? 056ff68e-615c-4412-8d35-a4626569b9fd rack1
With this patch, when a node is shutdown due to I/O error:
?N 127.0.0.1 296.2 MB 256 ? 056ff68e-615c-4412-8d35-a4626569b9fd rack1
Fixes#1629
Message-Id: <039196a478b5b1a8749b3fdaf7e16cfe2eb73a2f.1498528642.git.asias@scylladb.com>
The algorithm and principle of operation is the same as the CPU
controller. It is, however, always enabled and we will operate on
I/O shares.
I/O-bound workloads are expected to hit the maximum once virtual
dirty fills up and stay there while the load is steady.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Compactions can be a heavy disk user and the I/O scheduler can always
guarantee that it uses its fair share of disk.
Such fair share can, however, be a lot more than what compaction indeed
need. This patch draws on the controllers infrastructure to adjust the
I/O shares that the compaction class will get so that compaction
bandwidth is dynamically adjusted.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The control algorithm we are using for memtables have proven itself
quite successful. We will very likely use the same for other processes,
like compactions.
Make the code a bit more generic, so that a new controller has to only
set the desired parameters
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The issue is triggered by compaction of sstables of level higher than 0.
The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]
When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.
This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made.
Fix is to use ring_position in reader_selector, such that
inclusiveness would be respected.
So reader_selector::has_new_readers() won't return false positive
under the conditions described above.
Fixes#2908.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes#3096
The credentials processing for transitional auth was broken
in ba6a41d, "auth: Switch to sharded service which effectively removed
the "virtualization" of underlying auth in the SASL challenge.
As a quick workaround, add the permissive exception handling to
sasl object as well.
Message-Id: <20180103102724.1083-1-calle@scylladb.com>
Returning token_endpoints when there are many tokens and end points can
take a long time.
This patch uses output stream to return the result.
Instead of returning a vector, it uses the streaming functionality in
json layer.
Fixes#2476
Message-Id: <20180103081907.5175-1-amnon@scylladb.com>
Technically all that matters is the proportion among the shares so this
change is functionally a noop. However, The CPU scheduler being proposed
has shares that go all the way up to 1000. In the hopes of being able to
unify I/O and CPU controllers one day, this patch brings the I/O shares
more in line with what Avi is doing for the CPU scheduler.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The SizeTiered backlog for a single SSTable is defined as:
Bi = Ei * log4(T / Si)
Where:
- Si is the size of this individual SSTable
- T is the sum of sizes for all individual SSTables
- Ei is the effective bytes in this SSTable.
The Effective size of an SSTable is:
- The uncompacted size for an SSTable under compaction
- The partially written size for an SSTable being written
- The SSTable size for an SSTable that is not undergoing
any of those processes.
The Aggregate Backlog for the entire Table is just the sum of
all individual SSTable backlogs, including the SSTables currently
being written.
Care is taken to avoid iterating over all SSTables, by separating
the aggregate backlog into a static component (sstables not changing) and
a component of SSTables that are undergoing change.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch adds infrastucture in various points in the system to allow
us to determine the amount of work present as backlog from compactions.
What needs to be done can be explained in three major pieces:
1) Add hooks in the points where sstables are added or inserted to a
column family (or more precisely, to a compaction_strategy object).
2) Add hooks in reads and write monitors that allows a compaction
backlog estimator (tracker) to become aware of bytes that are
partially written and compacted away.
3) Add a per-column family class (compaction_backlog_tracker) that
can be used to track work that is done and relevant to compactions
(like the two above), and a compaction manager to provide a
system-wide backlog based on the response of the individual trackers.
The definition of how much backlog one has is strategy-specific. The
Null strategy is easy, as it never really has any backlog, and so is the
major strategy - since what it really matters is the backlog of the
underlying compaction strategy.
Although backlogs are strategy-specific, they should be "compatible", in
the sense that if a particular strategy has more work to do, it should
yield a higher number than its counterparts.
All the others are presented in this patch as unimplemented: they will
always advertise a mild backlog that should yield a constant
CPU-utilization if used alone.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We need to notify the monitor that the offset tracker that we are using is
about to be destroyed and will no longer be valid.
While we could modify the file_writer interface so that we could capture
the offset_tracker and take ownership of it - guaranteeing it is alive
until we reach the existing on_write_completed(), this feels like a
layer violation.
It is also potentially useful in general to offer the monitor callers
with knowledge that writing the data portion is done.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Passing the read monitor down to the sstable readers is tricky. The
point of interest - like compaction - are usually very far from the
interfaces that register the monitor, like read_rows. Between the two,
there is usually a mutation_reader, which is and ought to be totally
unaware of the read monitor: technically, a mutation_reader may not even
know it is backed by sstables.
The solution is to create a read_monitor_generator, that can be passed
from the upper layers, like compaction, to the layers that are actually
making the decision of which sstables to create readers for.
Note that we don't need an equivalent piece of infrastructure for
writes, because writes don't happen through hidden layers and have all
the information they need to initialize their monitors.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Similar to the write_monitor, it will track progress of an sstable
being read. In the current interface, we will notify interested users
about what is the current position in the data file.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Callers, like compactions, will be able to know at any time the current
progress of a read.
As we do that, the currently unimplemented position() method of
data_consume_context becomes redundant and is removed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Callers, like the memtable flusher or compactions will be able to find
out the current amount of bytes written at any time.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This came from Avi's review on the read_monitors. He suggests we
wouldn't keep shared pointers, and would instead have the caller
ensuring lifetime. That makes sense, but having the writer interface
using shared_ptr and the read interface using references would lead to
an inconsistent interface.
For the sake of consistency we will change the write monitor to take
references before we do that. From database.cc's perspective, we could
now keep the monitors in a do_with() block, but we will keep the
shared_ptrs to manage their lifetime in anticipation of upcoming patches
in this series, where we'll have to pass them somewhere else.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Compactions run from a seastar::thread, in run(). They will either fail
or succeed, and from the point of view of ordering of destruction
between the compaction object and its readers:
- if compaction succeed, we have no control over who gets destructed
first since both objects will be going out of scope.
- if they fail, we will forceably destruct the compaction object, at
which point the readers are still alive
From the point of view of lifetime management, it would be nice to make
sure that the compaction object outlives whichever other objects it
needs during compaction.
This nice to have will become paramount when we start adding
read_monitors to the compaction object, that have to, themselves outlive
the readers.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.
One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.
There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.
A similar bug was also present in partition_snapshot_flat_reader.
Possible solutions:
1) relax the assumption (in cache) that streams contain only relevant
range tombstones, and only require that they contain at least all
relevant tombstones
2) allow subsequent range tombstones in a stream to share the same
starting position (position is weakly monotonic), then we don't need
to de-overlap the tombstones in readers.
3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range
4) force leaf readers to trim all range tombstones to query restrictions
This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.
I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).
Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.
There is only one consumer which needs the tombstones with monotonic positions, and
that is the sstable writer.
Fixes #3093."
* tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev:
tests: row_cache: Introduce test for concurrent read, population and eviction
tests: sstables: Add test for writing combined stream with range tombstones at same position
tests: memtable: Test that combined mutation source is a mutation source
tests: memtable: Test that memtable with many versions is a mutation source
tests: mutation_source: Add test for stream invariants with overlapping tombstones
tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones
tests: mutation_reader: Test combined reader slicing on random mutations
tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys()
mutation_fragment: Introduce range()
clustering_interval_set: Introduce overlaps()
clustering_interval_set: Extract private make_interval()
mutation_reader: Allow range tombstones with same position in the fragment stream
sstables: Handle consecutive range_tombstone fragments with same position
tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone()
streamed_mutation: Introduce peek()
mutation_fragment: Extract mergeable_with()
mutation_reader: Move definition of combining mutation reader to source file
mutation_reader: Use make_combined_reader() to create combined reader
"delayed_tasks has a bug that if the object is destroyed while a timer
callback is queued, the callback will then try to access freed memory.
This series replaces the whole thing with sleep_abortable()."
* 'auth-delayed-tasks/v2' of https://github.com/duarten/scylla:
auth: Replace delayed_tasks with sleep_abortable
utils/exponential_backoff_retry: Add helper to automate retries
utils/exponential_backoff_retry: Add abort_source-based retry
Missed opportunity to feed shard id to sstable being written when
working on 67c5c8dc67, so when sstable is reopened after sealed,
its shard doesn't need to be recomputed by open procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171231024529.13664-1-raphaelsc@scylladb.com>
delayed_tasks has a bug that if the object is destroyed while a timer
callback is queued, the callback will then try to access freed memory.
This could be fixed by providing a stop() function that waits for
pending callbacks, but we can just replace the whole thing by levering
the abort_source-enabled exponential_backoff_retry.
This patch adds the do_until_value static member function to
exponential_backoff_retry, which retries the specified function until
it returns an engaged optional.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Scylla metadata could be empty due to bugs like the one introduced by
115ff10. Let's make shard computation resilient to empty sharding
metadata by falling back to the approach that uses first and last
keys to compute shards.
Refs #2932.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171223120140.3642-2-raphaelsc@scylladb.com>
SSTable can generate an empty sharding metadata after a bug like
the one introduced here 115ff10, that results in tokens being
generated using base table for the view table. That leads to
sstable being deleted in subsequent boot because all shards will
agree on its deletion given that it will not belong to anybody,
and also compaction to crash because this relies on resulting
sstable belonging to one shard at least.
I wouldn't like to spend days debugging it again because sstable
write silently generated empty sharding metadata, so let's make
write fail when it happens (see issue #2932 for details).
Refs #2932.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171223120140.3642-1-raphaelsc@scylladb.com>
Class optimized_optional was moved into seastar, and its usage
simplified so move_and_disengage() is replaced in favour of
std::exchange(_, { }).
* seastar adaca37...b0f5591 (9):
> Merge "core: Introduce cancellation mechanism" from Duarte
> Fix Seastar build that no longer builds with --enable-dpdk after the recent commit fd87ea2
> noncopyable_function: support function objects whose move constructors throw
> Adding new hardware options to new config format, using new config format for dpdk device
> Fix check for Boost version during pre-build configuration.
> variant_utils: add variant_visitor constructor for C++17 mode
> Merge "Allows json object to be stream to an" from Amnon
> Merge 'Default to C++17' from Avi
> Add const version of subscript operator to circular_buffer
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171228112126.18142-1-duarte@scylladb.com>
After materialized views has been implemented (although not enabled by
default), unimplemented::cause::VIEWS is no longer used. I think we can
drop it.
By the way, there are other no longer used unimplemented reasons, we
should probably drop them too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171224131318.4893-1-nyh@scylladb.com>
We provided "boost1.63" package for Debian 8 since we couldn't build
"scylla-boost163" package witch is available on Ubuntu14/16, but I fixed the
problem and now we have it for Debian 8 too, so switch to it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1514220163-25985-1-git-send-email-syuu@scylladb.com>
and make it a template to enable using it both with reference_wrapper
and flat_mutation_reader directly.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.
One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.
There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.
A similar bug was also present in partition_snapshot_flat_reader.
Possible solutions:
1) relax the assumption (in cache) that streams contain only relevant
range tombstones, and only require that they contain at least all
relevant tombstones
2) allow subsequent range tombstones in a stream to share the same
starting position (position is weakly monotonic), then we don't need
to de-overlap the tombstones in readers.
3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range
4) force leaf readers to trim all range tombstones to query restrictions
This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.
I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).
Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.
Fixes#3093.
"Remove old overloads that use mutation_reader."
* 'haaawk/combined_reader_clients_migration_v1_after_comments_2' of github.com:scylladb/seastar-dev:
Remove unused make_combined_reader overload.
Migrate test_fast_forwarding_combining_reader to flat reader
flat_mutation_reader_from_mutations: support partition_range
Don't pass fwd to flat_mutation_reader_from_mutations if it's no
Remove unused make_combined_reader overload.
Migrate test_combining_two_partially_overlapping_readers to flat reader
Migrate test_combining_two_non_overlapping_readers to flat reader
Migrate combined_mutation_reader_test to flat reader
Migrate test_sm_fast_forwarding_combining_reader to flat reader
Migrate test_combining_one_empty_reader to flat reader
Migrate test_combining_two_empty_readers to flat reader
Migrate test_combining_two_readers_with_one_reader_empty to flat reader
Migrate test_combining_one_reader_with_many_partitions to flat reader
Migrate test_combining_two_readers_with_the_same_row to flat reader
Migrate make_combined_mutation_source to flat reader
mutation_source: Add constructors for sources that ignore forwarding
This is needed to make it possible for
flat_mutation_reader_from_mutations to replace
make_reader_returning_many.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Default value for fwd is no so there's no need to pass it explicitly.
This is important because we will add additional parameter to
flat_mutation_reader_from_mutations in next patch.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Migrate all the places that used memtable::make_reader to use
memtable::make_flat_reader and remove memtable::make_reader.
* seastar-dev.git haaawk/remove_memtable_make_reader_v2_rebased:
Remove memtable::make_reader
Stop using memtable::make_reader in row_cache_stress_test
Stop using memtable::make_reader in row_cache_test
Stop using memtable::make_reader in mutation_test
Stop using memtable::make_reader in streamed_mutation_test
Stop using memtable::make_reader in memtable_snapshot_source.hh
Stop using memtable::make_reader in memtable::apply
Add consume_partitions(flat_mutation_reader& reader, Consumer consumer)
Add default parameter values in make_combined_reader
Migrate test_virtual_dirty_accounting_on_flush to flat reader
Migrate test_adding_a_column_during_reading_doesnt_affect_read_result
Simplify flat_reader_assertions& produces(const mutation& m)
Migrate test_partition_version_consistency_after_lsa_compaction_happens
flat_mutation_reader: Allow setting buffer capacity
Add next_mutation() to flat_mutation_reader_assertions
cf::for_all_partitions::iteration_state: don't store schema_ptr
read_mutation_from_flat_mutation_reader: don't take schema_ptr
Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader
Needed in tests to limit amount of prefetching done by readers, so
that it's easier to test interleaving of various events.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The following patches contain fixes for skipping to the next parititon
in multi_range_reader and completelty dissable support for fast
forwarding inside a single partition, which is not needed and would only
add unnecessary complexity.
* https://github.com/pdziepak/scylla.git fix-multi_range_reader/v1:
flat_multi_range_mutation_reader: disallow
streamed_mutation::forwarding
flat_multi_range_mutation_reader: clear buffer on next_partition()
tests/flat_multi_range_mutation_reader: test skipping to next
partition
Using Materialized Views, if the base table has static columns,
and the update in base table mutates static and non static rows,
the streamed_mutation is stopped before process non static row.
The patch avoids stopping the stream_mutation and adds a test case.
Message-Id: <20171220173434.25091-1-tavares.george@gmail.com>
end_bound was not updated in one of the cases in which end and
end_kind was changed, as a result later merging decision using
end_bound were incorrect. end_bound was using the new key, but the old
end_kind.
Fixes#3083.
Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com>
"4b9a34a85425d1279b471b2ff0b0f2462328929c "Merge sstable_data_source
into sstable_mutation_reader" has introduced unintentional changes, some
of them causing excessive read amplification during empty range reads.
The following patches restore the previous behaviour."
* tag 'fix-read-amplification/v1' of https://github.com/pdziepak/scylla:
sstables: set _read_enabled to false if possible
sstables: set _single_partition_read for single parititon reads
"Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written."
Fixes#1295.
* 'similar_sized_compaction_serialization_v4' of github.com:raphaelsc/scylla:
sstables: remove column_family from compaction_weight_registration
compaction_manager: serialize compaction of same size tier for different cfs
sstables: introduces deregister() and weight() to compaction_weight_registration
sstables: move compaction_weight_registration to its own header
sstables: improve compact_sstables() interface
compiler: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)
Problems introduced in f6a461c7a4
and 37b19ae6ba, respectively.
They both fail to compile due to use of method in lambda without
explicit mention of this. Some of failure is fixed by not using
auto in lambda parameter.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171218222144.12297-1-raphaelsc@scylladb.com>
* seastar 2b23547...adaca37 (7):
> Merge "Support for skipping over bytes from input stream in input_stream::consume" from Vladimir
> build: enforce Boost >= 1.58 during configuration.
> Tutorial: beginning of documentation of CPU scheduling et al.
> circular_buffer: make move-constructor noexcept
> circular_buffer: convert existing documentation to doxygen format
> build: fix detection of membarrier syscall support
> Merge "Improve systemwide_memory_barrier() on newer Linuces" from Avi
We had switched our own CentOS base image since we couldn't make built AMI to
public due to base image settings, it's probably because the image provided via
AWS market place.
However, I've found an official image outside of market place, and I succeeded
making built AMI to public based on the image.
URL: https://wiki.centos.org/Cloud/AWS
Once we could able to use official image, we probably should use official one.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513659164-28029-1-git-send-email-syuu@scylladb.com>
It creates a flat_mutation_reader from a reference to another reader.
This makes it easier to compose code in more elegant way.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Starting with commit fb0866ca20, tests
do not have to, and MUST NOT, define the disk error handlers. If they
do, we get a re-definition of variables already defined in
disk-error-handler.cc.
tests/hint_test.cc was apparently written before that commit, so we
need to remove the duplicate variables to get it to link.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218133635.20500-1-nyh@scylladb.com>
In commit 878d58d23a, a new parameter was
added to commitlog::descriptor. The commit message says that "It's default
value is a descriptor::FILENAME_PREFIX." while in reality, it did not have
a default value and compilation of tests/commitlog_test.cc broke, because
it didn't specify a value.
So this patch adds a default value for this parameter, as was suggested
by the original commit message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218131020.17883-1-nyh@scylladb.com>
In commit 1f4f71e619, an
stdx::optional<std::vector<sstring>> parameter was added to storage_proxy's
constructor. However, this parameter was not made optional, and
tests/cql_test_env.cc failed to compile because it didn't provide this
parameter.
This patch makes this parameter optional (if missing, it's like an empty
stdx::optional) so cql_test_env.cc compiles.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218132121.18782-1-nyh@scylladb.com>
and add read_context::enter_flat_partition. This will
temporarily coexist with read_context::enter_partition
but after everything in cache is migrated to flat reader
the new method will replace old one.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
called autoupdating_underlying_flat_reader. It will be modified
in the next patch to use flat reader to underlying.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written.
That being said, compaction jobs of same size tier are now serialized
on a given shard, such that maximum number of compaction (system wise)
is now:
(SHARDS) * (SIZE TIERS)
instead of:
(SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS)
We'll work hard to release a size tier (weight) for a column family
waiting on it as fast as possible, given that we wouldn't like to
underutilize resources available for compaction. We want one starting
after the other. Compaction for a column family that cannot run now
because the size tier is taken, will be postponed. There's a worker
that will be sleeping on a condition variable that will be signalled
whenever a compaction completes. FIFO ordering is used on postponed
list for fairness.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be needed for using it in compaction.hh. We can't declare
compaction_weight_registration in compaction_manager.hh, because
compaction.hh can't include the former due to cyclic dependency,
so compaction_weight_registration will be declared in its own
header.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that a new field in the descriptor will be forwarded
to compaction procedure without extending parameter list even more.
Also beautifies the interface, making it concise and easier to
play with.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The summary positions are defined to be in 'native' byte order.
Unfortunately this makes sharing files between big- and little-endian
machines much more difficult. For example, test files need to be
generated for both potential byte orders.
This change sets the byte order of the affected data to little-endian.
Ideally there would still be a way to deal with files generated on
big-endian systems using the 'native' byte order (see #3056).
Message-Id: <20171212183652.87881-1-mike.munday@ibm.com>
Don't enforce the outgoing connections from the 'listen_address'
interface only.
If 'local_address' is given to connect() it will enforce it to use a
particular interface to connect from, even if the destination address
should be accessed from a different interface. If we don't specify the
'local_address' the source interface will be chosen according to the
routing configuration.
Fixes#3066
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>
"Fixes #3057."
* 'summary_recreation_fixes_v2' of github.com:raphaelsc/scylla:
tests: sstable summary recreation sanity test
sstables: make loading of sstable without summary to work again
sstables: fix summary generation with dynamic index sampling
Currently scylla-housekeeping-daily.service/-restart.service hardcoded
"--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file,
but we use same .service for Ubuntu/Debian.
It doesn't work correctly, we need to specify .list file for Debian variants.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com>
"This series is the first part of hinted handoff implementation.
It includes:
- Minor adjustment of commitlog layer.
- Generation of hints when storage_proxy calls hint_to_dead_endpoints(...).
- Sending the hints to the Node that becomes UP.
It doesn't include:
- Node decommissioning.
- Resharding."
* 'hinted_handoff-v7-1' of github.com:vladzcloudius/scylla:
main + storage_service: wire up hints generation
config: add hints related options
db::hints::manager: initial commit
tracing: make the session state modifying methods and tracing::trace(...) noexcept
utils::fb_utilities: add is_me(addr) method
tests: hint_test: initial commit
db::commitlog::replay_position: added std::hash<replay_position>
db::commitlog: truncate segments to their actual sizes during shutdown
db::commitlog: allow defining a metrics category name
db/commitlog/commitlog::descriptor: add a filename_prefix parameter
db::commitlog::descriptor::descriptor(filename): pass a filename as a const ref
docs: hinted_handoff_design.md: high level design of a Hinted Handoff feature
- hints_directory:
- This option allows defining of the directory where hints files are going
to be stored if hinted handoff is enabled.
- hinted_handoff_enabled:
- May receive either a boolean value or a list of DCs. In the later case this
will define the DCs to which Nodes hints are going to be generated.
- max_hint_window_in_ms:
- Maximum amount of milliseconds the hints are going to be generated to the Node that is DOWN.
After this time period the hints are no longer going to be generated until the Node is seen UP.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Make state session creation, stop_forground() and tracing::trace(...) methods
noexcept.
Most of them have already been implemented the way that they won't throw
but this patch makes it official...
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Add a widely used method that returns TRUE if a given address is a broadcast
address of the local node.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Add a new field to db::commitlog::config that would define the metrics category name.
If not given - metrics are not going to be registered.
Set it to "commitlog" in db::commitlog::config(const db::config&).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This parameter is used when creating a new segment.
It's default value is a descriptor::FILENAME_PREFIX.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hinted Handoff is a feature that allows replaying failed writes.
The mutation and the destination replica are saved in a log and replayed
later according to the feature configuration.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Boot failed when loading sstable with missing summary because a
internal procedure failed to take into account that a sstable
can have its summary recreated from index. Make it work again
by making that procedure aware of that.
Fixes#3057.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When recreating summary, data length was passed as data offset to
procedure that decides whether to sample or not. The problem is
that the procedure decides to sample index entry if data offset
is beyond a threshold. So the resulting summary will contain
only N sequential indexes entries starting from the first one,
which makes it quite inefficient. What should be done instead
is to pass position of current index entry, so summary content
will be as if it was created by a regular sstable write.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
stream_session.cc:417:62: error: cannot call member function ‘utils::UUID streaming::stream_session::plan_id()’ without object
sslog.warn("[Stream #{}] Failed to send: {}", plan_id(), ep);
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171214022621.19442-1-raphaelsc@scylladb.com>
There are some users used original default cluster_name 'Test Cluster',
they will fail to start the node for cluster_name change if they use
new scylla.yaml.
'ScyllaDB Cluster' isn't more beautiful than 'Test Cluster', reset back
to original old to avoid problem for users.
Fixes#3060
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <8c9dab8a64d0f4ab3a5d6910b87af696c60e5076.1513072453.git.amos@scylladb.com>
"While aa8c2cbc16 'Merge "Migrate sstables
to flat_mutation_reader" from Piotr' has converted the low-level sstable
reader to the new flat_mutation_reader interface there were still
multiple readers related to sstables that required converting,
including:
- restricted reader
- filtering reader
- single partition sstable reader
This series completes their conversion to the flat stream interface."
* tag 'flat_mutation_reader-sstable-readers/v2' of https://github.com/pdziepak/scylla:
db: convert single_key_sstalbe_reader to flat streams
db: fully convert incremental_reader_selector to flat readers
db: make make_range_sstable_reader() return flat reader
db: make column_family::make_reader() return flat reader
db: make column_family::make_sstable_reader() return a flat reader
filtering_reader: switch to flat mutation fragment streams
filtering_reader: pass a const dht::decorated_key& to the callback
mutation_reader: drop make_restricted_reader()
db: use make_restricted_flat_reader
mutation_reader: convert restricted reader to flat streams
Before flat mutation readers sstable::read_row() returned a
future<streamed_mutation>. That required a helper reader that would wait
for the streamed_mutations from all relevant sstables to be created and
then construct a mutation merger.
With flat mutation readers sstable::read_row_flat() returns a
flat_mutation_reader (no futures) so that the code can be simplified by
collecting all the relevant readers and creating a combined reader
without suspension points.
The unfortunate disadvantage of the flat_mutation_reader-based approach
is the fact that combined reader now needlessly compares the partition
keys even though we know that we read only a single partition, but
optimising that is out of scope of this patch.
All users of the filtering reader need only the decorated key of a
partition, but currently the predicate is given a reference to
streamed_mutations which are obsolete now.
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.
It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.
Fixes#3065
Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.
We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.
Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.
Fixes#3062
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
"This patch series refines the security model for the upcoming switch to
roles-based access control. Roles are still do not have any function,
but CQL statements related to roles manipulate metdata. The next major
patch series after this one will switch the system to roles.
Previously, most operations around roles required superuser, but this
violates an important idea in security called the "principal of least
privilege": that a user should have only the minimum access possible to
resources in order to achieve their objective.
To that end, this patch series introduces permissions on role resources.
For example, to grant a role to a user, the performing user must have
been granted AUTHORIZE on the role being granted.
In the table below, a user (role) that has been granted the permission
in the left-most column can perform the CQL query in the right columns
depending on if the permission has been granted to the root role
resource (all roles), or a particular role resource.
Perm. All roles Specific role (r)
---------------------------------------------------------
CREATE CREATE ROLE
ALTER ALTER ROLE * ALTER ROLE r
DROP DROP ROLE * DROP ROLE r
AUTHORIZE GRANT ROLE */REVOKE GRANT ROLE r/
ROLE * REVOKE ROLE r
DESCRIBE LIST ROLES
The following restrictions around superuser exist:
- CREATE ROLE: Only a superuser can create a superuser role.
- ALTER ROLE: Only a superuser can alter the superuser status of a role.
- ALTER ROLE: You cannot alter the superuser status of yourself or of a
role granted to you.
- DROP ROLE: Only a superuser can drop a role that has superuser.
The following additional "escape hatches" apply:
- ALTER ROLE: You can alter yourself (except to give yourself
superuser).
- LIST ROLES: You can list your own roles and list the roles of any role
granted to you.
Finally, a note on terminology: I like to say a role (or user) "is"
superuser if the role (user) has directly been marked as a superuser. A
role (user) "has" superuser if they have been granted a role that is a
superuser. The second statement encompasses the first, since a role can
always be said to have been granted to itself.
Fixes #2988."
* 'jhk/role_permissions/v2' of https://github.com/hakuch/scylla: (24 commits)
auth: Move permissions cache instance to service
auth: Add roles query function to service
cql3: Update access checks for `revoke_role_statement`
cql3: Update access checks in `grant_role_statement`
cql3: Update access checks in `list_roles_statement`
cql3: Update access checks in `drop_role_statement`
cql3: Update access checks in `alter_role_statement`
cql3: Update access checks in `create_role_statement`
tests: Switch to dedicated testing superuser
auth: Publicize enforcing check for service
tests: Expose client state from test env
Allow checking permissions from `client_state`
auth: Support querying for granted superuser
auth/service.hh: Document the class
cql3: Change `create_role_statement` base
cql3/Cql.g: Add role resources to grammar
cql3/Cql.g: Avoid extra copy of `auth::resource`
auth:resource.cc: Use `string_view` in reverse map
auth: Add `role` resource kind
auth: Add the DESCRIBE permission
...
Instead of a single sharded service shared all by all instances of
`auth::service`, it makes more sense for each instance of
`auth::service` to own its own instance of the permissions cache.
While it just calls into the underlying role manager, this level of
indirection allows us to add a roles cache in the future (which is
consistent with the behavior of Apache Cassandra).
A user with DESCRIBE on the root role resource can list any roles of any
roles, and also the roles in the system.
Otherwise, a user can list all the roles it has been granted and can
list all roles granted to those roles.
A role can be dropped if the performer has DROP permission on the role.
A role that has superuser (either directly or through another role
it has been granted) cannot be dropped except by a superuser.
Only superusers can alter superuser status, but only to roles not
granted to them. You can always alter your own role. You can alter
another role if you have ALTER permission on the role.
CREATE ROLE requires CREATE on <ALL ROLES>. Creating a superuser role
requires that the performer is a superuser.
This change also forms the beginning of a test suite for the CQL
interface to roles. We start with verifying access-control properties of
CREATE ROLE as written in this patch.
The auth service will eventually add the default
superuser ("cassandra"), but the current code does so after a delay.
Using a dedicated superuser for unit tests side-steps the issue and
allows the user to be created immediately.
Previously, this function was private and only `ensure_has_permission`
was public. `ensure_has_permission` throws in the absence of a
permission, but it can also be useful to query a permission without it
being an error.
This functionality is useful for implementing CQL statements and will
replace `auth::is_super_user` once roles have replaced users in Scylla.
Since eventually the auth service will have a roles cache, this function
is here rather than a part `role_manager`.
When a user is granted DESCRIBE on all roles (a resource kind that
doesn't exist yet in the code, but will exist soon), they gain the
ability to execute LIST ROLES queries.
Different kinds of resources support different permissions. For example,
a keyspace supports the CREATE permission, which allows a user to
create tables in that keyspace. However, a table does not have an
applicable CREATE permission.
If a non-applicable permission is requested, an
`invalid_request_exception` is thrown.
"Fixes #2866Fixes#2894
Changes gossip propagation to allow "atomic" grouping of values to ensure
their respective order.
Modifies gossip bootstrap startup to potentially wait longer in cases
where stabilization (messages done) takes time, to avoid data loss
in repair."
* 'calle/gossip' of github.com:scylladb/seastar-dev:
gossip: wait for stabilized gossip on bootstrap
gossiper: Prevent race condition in propagation
utils::to_string: Add printers for pairs+maps
utils::in: Add helper type for perfect forwarding initializer lists
This set is equal to `permissions::ALL`. When we switch over to
resource-specific permission sets, we will filter the set of all
permissions to only those that are applicable for the resource in
question.
Applicable permission sets will soon be specific to each kind of
resource. This change prepares us for dynamic querying of permission
sets by resource.
* seastar ac78eec...2b23547 (10):
> Merge "update shares for I/O classes" from Glauber
> Merge "Resumable tasks" from Avi
> input_stream: un-unroll input_stream::consume()
> net: adding yaml-based parser for network configuration supporting multiple interfaces
> scripts: perftune.py: don't attempt to set IRQs' affinity when IRQs list is empty
> tutorial: fix example code
> http: api_docs add swagger 2.0 support
> Support custom function for reading of config-files.
> Revert "provide an interface for updating the shares of an I/O class"
> provide an interface for updating the shares of an I/O class
The encoding logic was incorrect for big endian systems (shift needed
to be in the opposite direction). Rather than fix that issue I have
re-written the relevant code to restrict the storage format to little
endian byte order on all systems. My hope is that this will be a bit
easier to maintain.
Message-Id: <20171211124454.77488-1-mike.munday@ibm.com>
Most distros on s390x don't currently have gold installed by default.
Rather than disable gold on the platform add a check to see if gold
is installed and switch back to using the default system linker if it
isn't. The try_compile_and_link functionality is copied from the
seastar project.
Message-Id: <20171211122156.77385-1-mike.munday@ibm.com>
"The changes in this series fall into one of the following:
1) improve unit tests
2) improve code reuse in mvcc so that later cahnges will be easier
3) fix minor issues which were exposed by the above"
* tag 'tgrabiec/improve-and-fix-mvcc-tests-v4' of github.com:scylladb/seastar-dev:
tests: mvcc: Add more tests for consistency of continuity merging
tests: mvcc: Fix test_apply_is_atomic()
tests: mvcc: Do not assume that continuity of current row is updated on partition_snapshot_row_cursor::maybe_refresh()
mvcc: Reuse partition_snapshot_row_cursor in apply_to_incomplete()
mvcc: Propagate region reference to partition_entry::apply_to_incomplete()
mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_if_complete()
mvcc: partition_snapshot_row_cursor: Extract prepare_heap()
mvcc: Add const-qualified partition_version_ref::operator*()
tests: mvcc: Use mutation_partition_assertions
tests: Introduce mutation_partition_assertions
tests: Randomize static row continuity in random_mutation_generator
tests: mutation_assertion: Introduce is_continuous()
mvcc: Introduce partition_snapshot_row_cursor::read_partition()
mutation_partition: Introduce deletable_row::apply() from a clustering_row fragment
mutation_partition: Extract sliced() from mutation into mutation_partition
mvcc: Introduce partition_snapshot::static_row_continuous()
mvcc: Introduce partition_snapshot::range_tombstones() for full range
mvcc: Don't require external schema in parition_snapshot::range_tombstones()
mutation_partition: Define equal_continuity() using get_continuity()
mutation_partition: Make check_continuity() const-qualified
mutation_partition: Make check_continuity() public
mutation_partition: Introduce mutation_partition::get_continuity()
Introduce clustering_interval_set
mutation_partition: Leave moved-from row in an empty state
mutation_partition: Fix upgrade() not preserving static row continuity
"This series includes a few patches from Michael Munday <mike.munday@ibm.com> (Z-project)
and a few from me. The most significant is PATCH10 that introduces a vectorized version
of CRC32 calculation (based on the Anton Blanchard's work)."
* 'scylla-power64-port-v2-1' of https://github.com/vladzcloudius/scylla:
test.py: limit the tests to run on 2 shards with 4GB of memory
tests: sstable_datafile_test: fix the compilation error on Power
tests: compound_test: fix the 'narrowing' compilation error on Power
cql3::constants::literal: fix the empty string parser
utils::crc32: add power64 crc32 HW accelerated implementation
repair: use seastar::cache_line_size for aligning to the cache line size
build: add -lcryptopp to libs
utils/allocation_strategy: force alignment to be at least sizeof(void*)
utils::crc: introduce process_le/be(T) methods
utils/crc: use zlib for crc32 on non-x86 platforms
main: only perform SSE 4.2 check on x86-family CPUs
configure.py: don't use 'gold' linker on Power
configure.pu: add --target flag to override -march value
'char' and int8_t ('unsigned char') are different types. 'bytes' base type
is int8_t - use the correct type for casting.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
'bytes' has int8_t as a base type and 0xff value is out of this type's range.
Use the corresponding signed value instead.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Don't assume the 'char' being signed - this is implementation dependent.
Compare to '\xFF' value which is the actual intent.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Use seastar::cache_line_size for cache line alignment instead of a hard coded value (64) - this value is
not always correct, e.g. PPC64 platform, where cache line size is 128B.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
It currently is updated only when iterators are invalidated. Better
to not assume that, because it's not really needed, and
maintaining this would complicate maybe_refresh() after continuity
merging rules change later.
The alignment of packed structs can be 1. The system¹ posix_memalign
function will return EINVAL when passed this alignment. This fix
forces the alignment to be at least sizeof(void*).
¹ The seastar implementation of posix_memalign does not appear to
have this limitation currently.
Replace the oblique process(T) overloads for integer types with
explicit process_le/be(T) methods that would interpret the given integer
as a stream of bytes using the corresponding endiannes.
For instance
process_le(0x11223344) would treat this integer as the following array of bytes:
{0x44, 0x33, 0x22, 0x11}.
process_be(0x11223344) on the other hand would treat this integer as if it's
{0x11, 0x22, 0x33, 0x44}.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This is probably the simplest way to make the build work on other
architectures. --target can be set to an empty string to allow
the compiler's default to be used.
If --target is not set then the default is going to be 'nehalem' on
x86 machines and the compiler's default on all other platforms.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Signed-off-by: Michael Munday <mike.munday@ibm.com>
This fixes the problem of equal_continuity() being prone to false
positives due to redundant information (extra dummy rows) present in
one of the partitions. get_continuity() is minified, so is not prone
to this.
Will make it easy to represent and manipulate continuity in tests.
Could also replace clustering_row_ranges in the future, which is
currently a naked vector<> with no semantic methods.
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.
Introduced in 2.0
Fixes #3053."
* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add reproducer for issue #3053
tests: mvcc: Add test for partition_snapshot::range_tombstones()
mvcc: Optimize partition_snapshot::range_tombstones() for single version case
mvcc: Fix partition_snapshot::range_tombstones()
tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
The issue is that partition_snapshot::range_tombstones() is
deoverlapping tombstones coming from different versions, and it may
happen that due to range tombstone splitting that function will return
a tombstone which starts after the requested range. This breaks
assumptions made by the cache reader. It keeps track of the maximum
fragment position, and if cache reader will then need to read from
sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.
partition_snapshot::range_tombstones() is deoverlapping tombstones
coming from different versions and it may happen that due to range
tombstone splitting the method will return a tombstone which starts
after the requested range. This would cause it to return a tombstone
which doesn't overlap with the requested range.
This breaks assumptions made by cache reader. It keeps track of the
maximum fragment position, and if cache reader will then need to read
from sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.
Exposed by a change in row_cache_test.cc::test_mvcc() which fills the
buffer of sm5 reader after it is created.
Fixes#3053.
It is assumed that dummy entries are only at !is_clustering_row() positions.
Causes cache_streamed_mutation to assert when trying to trim a range tombstone.
"Didn't affect any release. Regression introduced in 301358e.
Fixes#3041"
* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
tests: add sstable resharding test to test.py
tests: fix sstable resharding test
sstables: Fix resharding by not filtering out mutation that belongs to other shard
db: introduce make_range_sstable_reader
rename make_range_sstable_reader to make_local_shard_sstable_reader
db: extract sstable reader creation from incremental_reader_selector
db: reuse make_range_sstable_reader in make_sstable_reader
"In time-series, it's common for tables in a given time window to be eventually
fully expired. The deletion of such tables is done by compaction, but there's
*no* need to *actually* compact such fully expired sstables *iff* their full
deletion will not cause older data to be ressurected. In other words, a fully
expired table can be actually skipped (but deleted in the end) by compaction
*iff* it doesn't contain newer data than its overlapping counterparts. So there
may be false negatives, but never false positives.
All that said, the goal behind this patchset is to save read bandwidth of disk
in such scenarios. Given that fully expired sstables will not be read by
compaction process anymore, read amplification will be greatly reduced too.
Fixes #2620."
* 'time_series_performance_improvement_v2_2' of github.com:raphaelsc/scylla:
tests: check sstable auto correct bad max deletion time
tests: add test for compaction with fully expired table
sstables/compaction: do not actually compact fully expired sstables
sstables: make sstable auto correct max_local_deletion_time
sstables: switch to const ref wherever possible
sstables: use gc_clock::time_point for gc_before
gc_clock: introduce operator<<(ostream&, gc_clock::time_point)
sstables: introduce sstable::get_max_local_deletion_time
sstables: remove unnecessary copy in time series strategies
sstables: change return value type of get_fully_expired_sstables
dtcs: make code to extract non expired tables faster
sstables: add has_correct_max_deletion_time to sstable
"Soon we will have resources beyond just keyspaces and table names. There
will be resources for roles, for user-defined functions (UDFs), and
possible resources for REST end-points. This change generalizes the
implementation of a `data_resource` to many different kinds of
resources, though there is still only one kind (`data`).
The most important patch is 2/5 ("auth/resource: Generalize to different
kinds"), which re-writes `auth::data_resource`. The patch message should
sufficiently explain the design decisions involved.
The other patches rename files and identifiers based on the expanded
role of this class, except for 5/5 ("auth/resource.hh: Rename
`resource_ids`"): this patch gives a more appropriate name to a type
alias.
Fixes #3027."
* 'jhk/generalize_resource/v3' of https://github.com/hakuch/scylla:
auth/resource.hh: Rename `resource_ids`
auth: Rename `data_resource` files
cql3/authorization_statement: Fix typo
auth/resource: Generalize to different kinds
auth: Rename `data_resource` to `resource`
wrong sstable was used when checking for content, and storage service
for test was missing.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After 301358e, sstable resharding stopped work because shared sstables would
use a filtering reader, which excludes mutation that belong to other shards.
That completely breaks which relies on compaction of mutations that belong
to different shards. The fix is about using recently introduced non local
shard reader.
Fixes#3041.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
introduce reader variant that will allow its caller to read a range
in a given table without any filter applied.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Tomek says:
"I think that the least surprising behavior for a function named like this
is to read the sstables unfiltered (it just reads them), and the filtering
should be indicated specially in the name or by accepting a parameter."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There's no need to actually compact a sstable which is fully expired
and which deletion of all its data will not ressurect older data.
For that, a sstable will only be considered fully expired if it
doesn't contain data newer than its overlapping counterparts.
That way, there could be a false negative, but never a false positive.
Currently, a fully expired sstable would unnecessarily waste read
bandwidth of disk. This will help a lot time series workloads in
which data for a given time window is all deleted at once using TTL.
Fixes#2620.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstables created prior to cc6c383 can contain bad max deletion time stat,
which would make get_fully_expired_sstables return sstables that aren't
actually fully expired. Let's make sstable invalidate the stat if it
is potentially incorrect.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
unordered_set will allow us to quickly extract fully expired tables
from a set of compacting sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
since it's O(n) and not O(n log n).
change also needed for change in interface of function to retrieve
fully expired tables, or sort lambda would need to be parametrized.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit cc6c38324 fixes the stat. It was only updated for range
tombstone prior to fix, so a sstable that had a regular cell with
no expiration time could be considered fully expired which can
lead to bad decisions in compaction for time series workloads.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This change generalizes the implementation of a `resource` to many
different kinds of resources, though there is still only one
kind (`data`). In the future, we also expect resource kinds for roles,
user-defined functions (UDFs), and possibly on particular REST
end-points.
I considered several approaches to generalizing to different kinds of
resources.
One approach is to have a base class that is inherited from by different
resource kinds. The common functionality would be accessed through
virtual member functions and kind-specific functions would exist in
sub-classes. I rejected this approach because dealing with different
kinds of resources uniformly requires storage and life-time management
through something like `std::unique_ptr<auth::resource>`, which means
that we lose value semantics (including comparison) and must deal with
complications around ownership.
Another option was to use `boost::variant` (or, in future,
`std::variant`). This is closer to what we want, since there a static
set of resource kinds that we support. I rejected this approach for two
reasons. The first is that all resource kinds share the same data (a
list of segments and a root identifier), which would be duplicated in
each type that composed the variant. The second is that the complexity
and source-code overhead of `boost::variant` didn't seem warranted.
The solution I ended up with is home-grown variant. All resources are
described in the same `final` class: `auth::resource`. This class has
value semantics, supports equality comparison, and has a strict
ordering. All resources have in common a tag ("kind") and a list of
parts. Most operations on resources don't care about the kind of
resource (like getting its name, parsing a name, querying for the
parent, etc). These are just member functions of the class.
When we care about a kind-specific interpretation of a resource, we can
produce a "view" of the resource. For example, `data_resource_view`
allows for accessing the (optional) keyspace and table names.
I anticipate in the future to add functions for creating role
resources (`auth::resource::role`) and also `role_resource_view`.
The functional behaviour of the system should be unchanged with this
patch.
I've added new unit tests in `auth_resource_test.cc` and removed the old
test from `auth_test.cc`.
Fixes#3027.
"This is CASSANDRA-7886 and CASSANDRA-8592. The patch series detects
that CL of a request can no longer be reached due to errors and fails
the request earlier. New type of errors are reported: read/write failure
which were introduced in cql v4 protocol. For compatibility if older
protocol is used the error is translated to timeout error."
* 'gleb/request-failure_v2' of github.com:scylladb/seastar-dev:
storage_proxy: fail read/write requests early if it cannot be completed due to errors
storage_service: add WRITE_FAILURE_REPLY_FEATURE feature
gossiper: add node_has_feature() function
cql: add read/write failure exceptions
storage_proxy: fix data presence reporting in read timeout error during
storage_proxy: remove inheritance from enable_shared_from_this for abstract_write_response_handler
storage_proxy: remove unneeded field in abstract_write_response_handler
storage_proxy: fix pending endpoint accounting for EACH_QUORUM
consistency_level: constify quorum_for() and local_quorum_for()
When fast forwarding is enabled and all readers positioned inside the
current partition return EOS, return EOS from the combined-reader
too. Instead of skipping to the next partition if there are idle readers
(positioned at some later partition) available. This will cause rows to
be skipped in some cases.
The fix is to distinguish EOS'd readers that are only halted (waiting
for a fast-forward) from thoose really out of data. To achieve this we
track the last fragment-kind the reader emitted. If that was a
partition-end then the reader is out of data, otherwise it might emit
more fragments after a fast-forward. Without this additional information
it is impossible to determine why a reader reached EOS and the code
later may make the wrong decision about whether the combined-reader as
a whole is at EOS or not.
Also when fast-forwarding between partition-ranges or calling
next_partition() we set the last fragment-kind of forwarded readers
because they should emit a partition-start, otherwise they are out of
data.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <6f0b21b1ec62e1197de6b46510d5508cdb4a6977.1512569218.git.bdenes@scylladb.com>
Since pbuilder chroot environment does not install CA certificates by default,
accessing https://download.opensuse.org will cause certificate verification
error.
So we need to install it before installing 3rdparty repo GPG key.
Also, checking existance of gpgkeys_curl is not needed, since it's always
not installed since we are running the script in clean chroot environment.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512517001-27524-1-git-send-email-syuu@scylladb.com>
"This fix for the issue #2989 first adds unit tests for caching_options which
is the only class that uses the helpers from json.hh. This is done to
have regression tests in place for the main change.
The second commit adds conditional use of new recommended JsonCpp API
where available. For older versions of the library, it uses the old
code."
* 'issues/2989/v1' of https://github.com/argenet/scylla:
Use CharReaderBuilder/CharReader and StreamWriterBuilder from JsonCpp.
tests: Add unit tests for caching_options.
"This series makes sstable tests use flat stream interface. The main
motivation is to allow eventual removal of mutation_reader and
streamed_mutation and ensuring that the conversion between the
interfaces doesn't hide any bugs that would be otherwise found."
* tag 'flat_mutation_reader-sstable-tests/v1' of https://github.com/pdziepak/scylla:
sstables: drop read_range_rows()
tests/mutation_reader: stop using read_range_rows()
incremental_reader_selector: do not use read_range_rows()
tests/sstable: stop using read_range_rows()
sstables: drop read_row()
tests/sstables: use read_row_flat() instead of read_row()
database: use read_row_flat() instead of read_row()
tests/sstable_mutation_test: get flat_mutation_readers from mutation sources
tests/sstables: make sstable_reader return flat_mutation_reader
sstable: drop read_row() overload accepting sstable::key
tests/sstable: stop using read_row() with sstable::key
tests/flat_mutation_reader_assertions: add has_monotonic_positions()
tests/flat_mutation_reader_assertions: add produces(Range)
tests/flat_mutation_reader_assertions: add produces(mutation)
tests/flat_mutation_reader_assertions: add produces(dht::decorated_key)
tests/flat_mutation_reader_assertions: add produces(mutation_fragment::kind)
tests/flat_mutation_reader_assertions: fix fast forwarding
The assertions already have produces(mutation) and
produces(dht::decorated_key) overloads. Additional overload that accepts
a range of elements will allow to check if a range of mutations of
decorated keys is produced.
The same interface is exposed by mutation_reader_assertions.
Fixes#2866
Instead of a raw 30s sleep waiting for gossip to stabilize/set up
ranges on bootstrap, use similar logic as 'wait_for_gossip_to_settle'
and loop said 30s or more until we neither grow/shrink ep set, or
are processing ACK:s.
Fixes#2894
Allow applying certain application states as monotonic sets,
i.e. allow set of states as input, and ensure the values are
re-versioned and all applied together.
Then do so for certain states that are by design coupled
(status/tokens).
Similar solution as origins, as issue is copy of the same.
produces(mutation_fragment::kind) is provided by
streamed_mutation_assertions and is going to be needed in order to
fully convert tests to the flat mutation readers.
Both fast_forward_to() overloads return a future which should be waited
for. Additionally, fast_forward_to(const dht::partition_range&) expects
the range to remain valid at least until the next call to
fast_forward_to(). The original mutation_reader_assertions guaranteed
that and so should flat_mutation_reader_assertions.
* seastar dc44656...ac78eec (3):
> json formatter: Add unsigned support to the json formatter
> Add missing usual smart-pointer methods to foreign_ptr
> future-util: remove use of forward references in some primitives
Recently, memtable flush in test requires storage service for tests,
or it fails with "Assertion `local_is_initialized()' failed".
storage_service_for_tests needs to run in a thread, that's why
flush_memtable was flattened.
Last but not least, we need to revert flushed memory account because
same memtable is used for all sstables in the perf test so as not
to trigger `_mt._flushed_memory <= _mt.occupancy().used_space()'
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171205012853.21559-1-raphaelsc@scylladb.com>
In version 1.8.3 of JsonCpp shipped with Fedora 27, old FastWriter and
Reader classes from JsonCpp have been deprecated in favour of
newer/better ones: CharReaderBuilder/CharReader and
StreamWriterBuilder/StreamWriter.
This fix uses the new classes where available or resorts to old ones for
older versions of the library.
Fixes#2989
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"Convert combined_mutation_reader into a flat_mutation_reader impl. For
now - in the name of incremental progress - all consumers are updated to
use the combined reader through the
mutation_reader_from_flat_mutation_reader adaptor. The combined reader also
uses all it's sub mutation_readers through the
flat_mutation_reader_from_mutation_reader adaptor."
* 'bdenes/flatten-combined-reader-v8' of https://github.com/denesb/scylla:
Add unit tests for the combined reader - selector interactions
Add flat_mutation_reader overload of make_combined_reader
Flatten the implementation of combined_mutation_reader
Add mutation_fragment_merger
mutation_fragment::apply(): handle partition start and end too
Add non-const overload of partition_start::partition_tombstone()
Make combined_mutation_reader a flat_mutation_reader
Move the mutation merging logic to combined_mutation_reader
Remove the unnecessary indirection of mutation_reader_merger::next()
Move the implementation of combined_mutation_reader into mutation_reader_merger
Remove unused mutation_and_reader::less_compare and operator<
Our external repos are already signed repo, so let's enable secure-apt.
Seems like more recent version of Ubuntu (tested on 18.04) does not accept
skipping GPG check, so we need it anyway in near future.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Now we can cross build our .rpm/.deb packages, so let's extend AMI build script
to support cross build, too.
Also Ubuntu 16.04 support added, since it's latest Ubuntu LTS release.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1510247204-2899-1-git-send-email-syuu@scylladb.com>
There are a few edge cases that were untested and as this patch-series
reworks completely how the combined-reader works these should be tested
as well to ensure they keep working.
This is the mutation fragment level equivalent of mutation_merger.
It merges fragments produced by different sources. Mutation
fragments are not as self-contained as streamed mutations, they have
external context, e.g. the partition they belong to. To support this
mutation_fragment_merger operates on a producer instead of a vector of
fragments. Producer can have internal state and can do side-actions as
fragments are consumed.
For now only the interface is converted, behind the scenes the previous
implementation remains, it's output is simply converted by
flat_mutation_reader_from_mutation_reader. The implementation will be
converted in the following patches.
This simple code-movement and patch lays the groundwork for splitting
the logic in combined_mutation_reader into two blocks:
* one that takes care of moving the readers in lockstep and emits their
output as a non-decreasing stream of streamed_mutations and
* one that takes care of merging the above stream into
strictly-increasing stream of streamed_mutations.
This in turn is preparation-work to the transformation of
combined_mutation_reader into a flat_mutation_reader::impl.
"A delayed task can fail to execute, for example if the consistency
level the task required can't be achieves, so we should ensure it is
retried.
Fixes#3038"
* 'auth-retry/v2' of https://github.com/duarten/scylla:
auth/standard_role_manager: Extend exception handling
auth/common: Add exception handling and retry to task scheduling
auth/standard_role_manager: Lift async block to caller
Also handle exceptions thrown by has_existing_roles(), and print a
similar message to Apache Cassandra in case of error.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This follows the implementation in Apache Cassandra. The auth tasks
executed by delay_until_system_ready() usually perform a query with
QUORUM consistency level, which can fail if some nodes are
unavailable. So, we provide both exception handling and a retry
mechanism.
Fixes#3038
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
has_existing_roles() creates a seastar thread, but that can be
lifted to the caller for prettier code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We are getting package build error on dh_auto_install which is invoked by
pybuild.
But since we handle all installation on debian/scylla-server.install, we can
simply skip running dh_auto_install.
Fixes#3036
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512065117-15708-1-git-send-email-syuu@scylladb.com>
"This series makes it easier to comprehend assertion failures which
involve printing mutation contents."
* 'tgrabiec/mutation-printout' of github.com:scylladb/seastar-dev:
tests: Introduce mutation_diff script
mutation: Make printout more concise
mutation_partition: Don't print absent elements
mutation_partition: Make row_marker printout similar to other partition elements
database: Move operator<<() overloads to appropriate source files
mutation_partition: Use multi-line printout
position_in_partition: Improve printout
Convert to a multi line output, which is easier to read for a human.
After:
{ks.cf key {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} data {mutation_partition: {tombstone: none},
range_tombstones: {},
static: cont=1 {row: },
clustered: {
{rows_entry: cont=true dummy=false {position: clustered,ckp{000c636b30303030303030303030},0} {deletable_row: {row: }}},
{rows_entry: cont=true dummy=true {position: clustered,ckp{000c636b30303030303030303031},0} {deletable_row: {row: }}}}}}
Before:
{position: type clustered, bound_weight -1, key ckp{000c636b30303030303030303033}}
After:
{position: clustered,ckp{000c636b30303030303030303033},-1}
Benefits:
- most significant parts appear first.
bound_weight, which is least significant, was in the middle before.
- shorter, so a bit easier to parse assertion failures.
The universal reference was introduced so we could bind an rvalue to
the argument, but it would have sufficed to make the argument a const
reference. This is also more consistent with the function's other
overload.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171129132758.19654-1-duarte@scylladb.com>
Fix two issues with serializing non-compound range tombstones as
compound: convert a non-compound clustering element to compound and
actually advertise the issue to other nodes.
* git@github.com:duarten/scylla.git rt-compact-fixes/v1:
compound_compact: Allow rvalues in size()
sstables/sstables: Convert non-compound clustering element to compound
tests/sstable_mutation_test: Verify we can write/read non-correct RTs
service/storage_service: Export non-compound RT feature
Add test to verify we can write and read non-compound tombstones and
compound ones for backward compatibility.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
576ea421dc introduced a regression
as it didn't change the assumption that all clustering elements where
compound when writing a range tombstone, compound or non-compound, as
compound. Thus, we serialized a non-compound element while we should
have serialized a compound one.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
after 7f8b62bc0b, its move operator and ctor broke. That potentially
leads to error because data_consume_context dtor moves sstable ref
to continuation when waiting for in-flight reads from input stream.
Otherwise, sstable can be destroyed meanwhile and file descriptor
would be invalid, leading to EBADF.
Fixes#3020.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171129014917.11841-1-raphaelsc@scylladb.com>
"This simplifies implementation of mutation_partition merging by relaxing
exception guarantees it needs to provide. This allows reverters to be dropped.
Direct motivation for this is to make it easier to implement new semantics
for merging of clustering range continuity.
Implementation details:
We only need strong exception guarantees when applying to the memtable, which is
using MVCC. Instead of calling apply() with strong exception guarantees on the latest
version, we will move the incoming mutation to a new partition_version and then
use monotonic apply() to merge them. If that merging fails, we attach the version with
the remainder, which cannot fail. This way apply() always succeeds if the allocation
of partition_version object succeeds.
Results of `perf_simple_query_g -c1 -m1G --write` (high overwrite rate):
Before:
101011.13 tps
102498.07 tps
103174.68 tps
102879.55 tps
103524.48 tps
102794.56 tps
103565.11 tps
103018.51 tps
103494.37 tps
102375.81 tps
103361.65 tps
After:
101785.37 tps
101366.19 tps
103532.26 tps
100834.83 tps
100552.11 tps
100891.31 tps
101752.06 tps
101532.00 tps
100612.06 tps
102750.62 tps
100889.16 tps
Fixes #2012."
* tag 'tgrabiec/drop-reversible-apply-v1' of github.com:scylladb/seastar-dev:
mutation_partition: Drop apply_reversibly()
mutation_partition: Relax exception guarantees of apply()
mutation_partition: Introduce apply_weak()
tests: mvcc: Add test for atomicity of partition_entry::apply()
tests: Move failure_injecting_allocation_strategy to a header
tests: mutation_partition: Test exception guarantees of apply_monotonically()
mvcc: Use apply_monotonically() where sufficient
mvcc: partition_version: Use apply_monotonically() to provide atomicity
mvcc: Extract partition_entry::add_version()
mutation_partition: Introduce apply_monotonically()
mutation_partition: Introduce row::consume_with()
The uses which needed strong or weak exception guarantees were
switched to a solution involving apply_monotonically(). All remaining
uses don't need any exception guarantees.
This patch drops the use of apply_reversibly(). We move the mutation
to be applied into a new version and then use apply_monotonically() to
merge it (if no snapshot) with the current version. This guarantees
that apply() is atomic even if apply_monotonically() throws.
Fixes#2012.
Has weaker exception guarantees than apply(), which allows for simpler
implementation. Intended to replace the apply() with strong exception
guarantees.
This series converts memtable flush reader to the new flat mutation
readers. Just like the scanning reader, flush reader concatenates
multiple partition snapshot readers in order to provide a stream
of all partitions in the memtable.
* https://github.com/pdziepak/scylla.git flat_mutation_reader-memtable-flush/v1
tests/flat_mutation_reader_assertion: add produces_partition()
memtable: make make_flush_reader() return flat_mutation_reader
flat_mutation_reader: add optimised flat_mutation_reader_opt
memtable: switch flush reader implementation to flat streams
tests/memtable: add test for flush reader
"This series adds the role-management interface, the primary implementation, and the corresponding CQL.
Importantly, this series does not integrate the system with roles, nor does it remove user-based access control. Several new CQL statements are available and should function, but these modify metadata only and have no functional impact on the actual
+system.
The new statements are:
- CREATE ROLE
- ALTER ROLE
- DROP ROLE
- GRANT ROLE
- REVOKE ROLE
- LIST ROLES
The security model of the role manager is simple at this point: only superusers can create and drop roles. The next patch series will introduce fine-grained role permissions and also slightly change the CQL syntax to more consistent with the
+rest of the grammar. This patch series is a starting point for evolving the roles feature and integrating it.
Fixes #2987."
* 'jhk/role_management/v5' of https://github.com/hakuch/scylla:
auth: Add `alter_role_statement`
auth: Add `create_role_statement`
auth: Add `drop_role_statement`
auth: Add 'revoke_role_statement'
auth: Add `grant_role_statement`
auth: Add `list_roles_statement`
auth: Add dormant role manager to `service`
auth/service.cc: Remove redundant declarations
cql3: Add `role_name` and parser rules
auth: Add role manager
auth: Unconditionally create the `system_auth` keyspace
unimplemented.hh: Use [[noreturn]] instead of GCC attribute
New `unimplemented` feature: roles
Unlike Apache Cassandra, the role manager does not write data related to
password authentication in the metadata tables, and the rest of the
system does not yet integration with the role manager.
Therefore, executing `CREATE ROLE` currently ignores all
authentication-related options (`PASSWORD` and `OPTIONS`).
Dropping a role removes all references to it from other roles.
As with the role-management statements, executing this statement updates
metadata but has not functional impact yet.
While granting a role updates the necessary metadata, since roles do not
interact with the rest of the system yet, there is not functional impact
of doing so.
The role manager still does not interact with the rest of the system,
but the role manager is now sharded on all cores and metadata is
created.
The following metadata tables are created:
- `system_auth.roles`
- `system_auth.role_members`
The default superuser, "cassandra", is also created, but has no function.
The `userOrRoleName` parser rule is important for future CQL
role-related statements.
`cql3::role_name` is a small utility for role-related CQL statements
that enforce an important property of role names: that they are always
lower-case unless quoted appropriately.
The role manager is responsible for creating, removing, querying for,
granting, and revoking roles.
The role manager does not yet run in production, and is not connected to
the rest of the system.
Included in this patch is the definition of the abstract role management
interface, and also the implementation of the standard role manager.
The standard role manager is tested fully in the `role_manager_test`.
The `system_auth` keyspace is used to store tables for authentication
and authorization metadata.
Previously, this keyspace would only be created if either of the
non-default authenticator or authorizer were activated in configuration.
The upcoming role-management system is enabled unconditionally and also
uses the `system_auth` keyspace for its metadata.
These tests now require having the storage service initialize, which
is needed to decide whether correct non-compound range tombstones
should be emitted or not.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126152921.5199-1-duarte@scylladb.com>
* seastar 7f87529...3b09bad (7):
> Extend Travis CI to cover Clang 5.0 builds.
> fair_queue: disallow zeroed shares.
> Multiple fixes to io_tester to make it compile with GCC 5:
> transformers: Create tuple explicitely for older compiler support
> core/sstring: Add construction from `string_view`
> io_tester: enhanced fair queue tester
> fstream: do not ignore dma_write return value
The following patches convert sstable writers to use flat mutation
readers instead of the legacy mutation_reader interface.
Writers were already using flat consumer interface and used
consume_flattened_in_thread(), so most of the work was limited to
providing an appropriate equivalent for flat mutation readers.
* https://github.com/pdziepak/scylla.git flat_mutation_reader-sstable-write/v1:
flat_mutation_reader: move consumer_adapter out of consume()
flat_mutation_reader: introduce consume_in_thread()
tests/flat_mutation_reader: test consume_in_thread()
sstables: switch write_components() to flat_mutation_reader
streamed_mutation: drop streamed_mutation_returning()
sstables: convert compaction to flat_mutation_reader
mutation_reader: drop consume_flattened_in_thread()
This series mainly fixes issues with the serialization of promoted
index entries for non-compound schemas and with the serialization of
range tombstones, also for non-compound schemas.
We lift the correct cell name writing code into its own function,
and direct all users to it. We also ensure backward compatibility with
incorrectly generated promoted indexes and range tombstones.
Fixes#2995Fixes#2986Fixes#2979Fixes#2992Fixes#2993
* git@github.com:duarten/scylla.git promoted-index-serialization/v3:
sstables/sstables: Unify column name writers
sstables/sstables: Don't write index entry for a missing row maker
sstables/sstables: Reuse write_range_tombstone() for row tombstones
sstables/sstables: Lift index writing for row tombstones
sstables/sstables: Leverage index code upon range tombstone consume
sstables/sstables: Move out tombstone check in write_range_tombstone()
sstables/sstables: A schema with static columns is always compound
sstables/sstables: Lift column name writing logic
sstables/sstables: Use schema-aware write_column_name() for
collections
sstables/sstables: Use schema-aware write_column_name() for row marker
sstables/sstables: Use schema-aware write_column_name() for static row
sstables/sstables: Writing promoted index entry leverages
column_name_writer
sstables/sstables: Add supported feature list to sstables
sstables/sstables: Don't use incorrectly serialized promoted index
cql3/single_column_primary_key_restrictions: Implement is_inclusive()
cql3/delete_statement: Constrain range deletions for non-compound
schemas
tests/cql_query_test: Verify range deletion constraints
sstables/sstables: Correctly deserialize range tombstones
service/storage_service: Add feature for correct non-compound RTs
tests/sstable_*: Start the storage service for some cases
sstables/sstable_writer: Prepare to control range tombstone
serialization
sstables/sstables: Correctly serialize range tombstones
tests/sstable_assertions: Fix monotonicity check for promoted indexes
tests/sstable_assertions: Assert a promoted index is empty
tests/sstable_mutation_test: Verify promoted index serializes
correctly
tests/sstable_mutation_test: Verify promoted index repeats tombstones
tests/sstable_mutation_test: Ensure range tombstone serializes
correctly
tests/sstable_datafile_test: Add test for incorrect promoted index
tests/sstable_datafile_test: Verify reading of incorrect range
tombstones
sstables/sstable: Rename schema-oblivious write_column_name() function
sstables/sstables: No promoted index without clustering keys
tests/sstable_mutation_test: Verify promoted index is not generated
sstables/sstables: Optimize column name writing and indexing
compound_compat: Don't assume compoundness
TTL of 1 second may cause the cell to expire right after we write it,
if the second component of current time changes right after it. Use
larger ttl to avoid spurious faliures due to this.
Message-Id: <1511463392-1451-1-git-send-email-tgrabiec@scylladb.com>
This patch changes some factory functions so that they don't assume
the schema is compound.
This enables some code simplification in
sstables::write_column_name().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of serializing the column name twice, serialize it once into a
buffer which gets used for index bookkeeping and to write to disk.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
flat_mutation_reader provides a replacement for the old
consume_flattened*() interface and therefore an 'in-thread' variant is
also necessary. It expects to be executed in a seastar::thread context
and guarantees that the consumer member functions will be invoked inside
that thread as well (which is why it cannot be easily replaced by
non-thread version).
Addition to that, just like the old consume_flattened_in_thread() its
replacement allows specifying a filter functions that causes selected
partitions to be skipped entirely and never reach the consumer.
This function is now called write_compound_non_dense_column_name() so
callers are aware of the cases where it call be called.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Add a test to verify that we can still read incorrectly written range
tombstones for non-compound schemas, for previous Scylla versions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures we correctly serialize range tombstones for dense
non-compound schemas, which until now assumed the bounds were compound
composite. We also fix the reading function, which assumed the same
thing. This affected Apache Cassandra compatibility.
Fixes#2986
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support to sstable_writer to be able to control
correct range tombstone serialization.
When range tombstone serialization will be fixed in subsequent
patches, it will only be enabled when the whole cluster supports the
feature to allow for rollbacks.
The feature needs to be enabled for an sstable as a whole, to prevent
problems with it being enabled during an sstable write.
Thus, the sstable writer will pass on this information to the sstable
methods that carry out the actual file writing.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a cluster feature to enable correct serialization of
non-compound range tombstones. We thus support rollbacks during an
upgrade, as we will only change range tombstone serialization when the
cluster is fully upgraded and all nodes are capable of reading the new
format.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the range tombstone read path to deal with
correctly written non-compound range tombstones, while also
maintaining backward compatibility and reading old Scylla-generated
range tombstones.
The fix for the write path will activate an sstable feature which will
connect with this patch.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We cannot represent ranged deletions with non-inclusive bounds on our
current storage format for schemas that are non-compound, since the
clustering key won't include the EOC byte.
Refs #2986
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Promoted indexes generated before this patch by Scylla are considered
incorrect if they belong to a non-compound schema, due to #2993.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds additional metadata to the scylla sstable component.
Namely, it adds a list of features that the current sstable supports.
The upcoming usages of the feature list are meant for backward
compatibility, but the implementation makes no such assumptions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch refactors writing a promoted index entry to leverage the
column_name_writer. It not only reduces code duplication, but also
solves two important bugs:
1) Column names for schema types other than compound non-dense were
not correctly serialized, as the wrong overload of
write_column_name() was being called, which assumed the specified
composite to be compound.
2) Before, for some schema types we were passing an empty
clustering_key to maybe_flush_pi_block(), which caused it to bypass
appending open range tombstones to the data file, causing wrong
query results to be returned.
Fixes#2979Fixes#2992Fixes#2993
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch lifts the logic to write a column name depending on the
schema's denseness and compoundness into a function, so that it may
later be reused in other places. We still duplicate the same logic
when writing a clustered row because the index writer requires it for
now.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A schema can only have static columns if it has at least one
clustering column. A schema with a clustering column is always
compound, unless it is created with compact storage. A schema created
with compact storage cannot have static columns, so we can remove dead
code from the sstable write path.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Encapsulate the decision to write the row_marker and to write a
corresponding entry in the promoted index. We now avoid writing the
index entry if there is no row marker, and just start indexing the row
at the first cell.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Making consumer_adapter a member of flat_mutation_reader::impl instead
of being a local class in consume() will make it possible to reuse that
helper in other functions.
It's hard to make sense of the metric transport.requests_blocked_memory
because it shows a queue size. Specially in production setups scraping
at every 15 seconds, that doesn't tell us much.
We solve that in other layers that record blocking by providing both a
requests_blocked_memory and requests_blocked_memory_current
Fixes#3010
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171123033329.32596-1-glauber@scylladb.com>
Prometheus histograms have 3 embedded metrics: count, buckets, and sum.
Currently we fill up count and buckets but sum is left at 0. This is
particularly bad, since according to the prometheus documentation, the
best way to calculate histogram averages is to write:
rate(metric_sum[5m]) / rate(metric_count[5m])
One way of keeping track of the sum is adding the value we sampled,
every time we sample. However, the interface for the estimated histogram
has a method that allows to add a metric while allowing to adjust the
count for missing metrics (add_nano())
That makes acumulating a sum inaccurate--as we will have no values for
the points that were added. To overcome that, when we call add_nano(),
we pretend we are introducing new_count - _count metrics, all with the
same value.
Long term, doing away with sampling may help us provide more accurate
results.
After this patch, we are able to correctly calculate latency averages
through the data exported in prometheus.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171122144558.7575-1-glauber@scylladb.com>
* seastar-dev.git haaawk/flat_reader_remove_read_rows:
sstable_mutation_test: use read_rows_flat instead of read_rows
perf_sstable: use read_rows_flat instead of read_rows
Remove sstable::read_rows
Introduce sstable::read_row_flat and sstable::read_range_rows_flat methods
and use them in sstable::as_mutation_source.
* https://github.com/scylladb/seastar-dev/tree/haaawk/flat_reader_sstables_v3:
Introduce conversion from flat_mutation_reader to streamed_mutation
Add sstables::read_rows_flat and sstables::read_range_rows_flat
Turn sstable_mutation_reader into a flat_mutation_reader
sstable: add getter for filter_tracker
Move mp_row_consumer methods implementations to the bottom
Remove unused sstable_mutation_reader constructor
Replace "sm" with "partition" in get_next_sm and on_sm_finished
Move advance_to_upper_bound above sstable_mutation_reader
Store sstable_mutation_reader pointer in mp_row_consumer
Stop using streamed_mutation in consumer and reader
Stop using streamed_mutation in sstable_data_source
Delete sstable_streamed_mutation
Introduce sstable::read_row_flat
Migrate sstable::as_mutation_source to flat_mutation_reader
Remove single_partition_reader_adaptor
Merge data_consume_context::impl into data_consume_context
Create data_consume_context_opt.
Merge on_partition_finished into mark_partition_finished
Check _partition_finished instead of _current_partition_key
Merge sstable_data_source into sstable_mutation_reader
Remove sstable_data_source
Remove get_next_partition and partition_header
to check whether partition is finished. In next patch
_current_partition_key will be merged with sstable_data_source::_key
and won't be cleared any more.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This will be used in sstable_mutation_reader before
first fill_buffer is called and a proper data_consume_context
is created.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Since we want to support cross building, we shouldn't hardcode GPG file path,
even these files provided on recent version of mock.
This fixes build error on some older build environment such as CentOS-7.2.
Fixes#3002
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511277722-22917-1-git-send-email-syuu@scylladb.com>
These patches convert queries (data, mutation and counter) to flat
mutation readers. All of them already use consume_flattened() to
consume a flat stream of data, so the only major missing thing
was adding support for reversed partitions to
flat_mutation_reader::consume().
* pdziepak flat_mutation_reader-queries/v3-rebased:
flat_mutation_reader: keep reference to decorated key valid
flat_muation_reader: support consuming reversed partitions
tests/flat_mutation_reader: add test for
flat_mutation_reader::consume()
mutation_partition: convert queries to flat_mutation_readers
tests/row_cache_stress_test: do not use consume_flattened()
mutation_reader: drop consume_flattened()
streamed_mutation: drop reverse_streamed_mutation()
Some queries may need the fragments that belong to partition to be
emitted in the reversed order. Current support for that is very limited
(see #1413), but should work reasonably well for small partitions.
consume_flattened() guarantees that partition key (passed by reference)
will be valid until the end of partition.
flat_mutation_reader::consume() provides the same interface for consumer
so it also should make sure that the key remains valid.
For a time range tombstone that was already removed from a tree
is owned by a raw pointer. This doesn't end well if creation of
a mutation fragment or a call to push_mutation_fragment() throw.
Message-Id: <20171121105749.16559-1-pdziepak@scylladb.com>
Don't std::move() the "query" string inside the parallel_for_each() lambda.
parallel_for_each is going to invoke the given callback object for each element of the range
and as a result the first call of lambda that std::move()s the "query" is going to destroy it for
all other calls.
Fixes#2998
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1511225744-1159-1-git-send-email-vladz@scylladb.com>
This will be used together with sstables::read_range_rows
to migrate sstables::as_mutation_source().
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Don't use streamed_mutation in mp_row_consumer
and sstable_mutation_reader.
Also use sstable_mutation_reader in sstable::read_row.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
* seastar 78cd87f...7f87529 (3):
> exception: use phdr hash on reactor threads only
> tests: httpd use noncopyable_function
> Merge "fixes of issues found by seastar's unit tests" (ppc) from Vlad
Fixes#2967.
Currently flat_mutation_reader_from_mutation_reader()'s
converting_reader will throw std::runtime_error if fast_forward_to() is
called when its internal streamed_mutation_opt is disengaged. This can
create problems if this reader is a sub-reader of a combined reader as the
latter has no way to determine the source of a sub-reader EOS. A reader
can be in EOS either because it reached the end of the current
position_range or because it doesn't have any more data.
To avoid this, instead of throwing we just silently ignore the fact that
the streamed_mutation_opt is disengaged and set _end_of_stream to true
which is still correct.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <83d309b225950bdbbd931f1c5e7fb91c9929ba1c.1511180262.git.bdenes@scylladb.com>
The exception handling code inspects server state, which could be
destroyed before the handle_exception() task runs since it runs after
exiting the gate. Move the exception handling inside the gate and
avoid scheduling another accept if the server has been stopped.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171116122921.21273-1-duarte@scylladb.com>
We get following warning from antlr3 header when we compile Scylla with gcc-7.2:
/opt/scylladb/include/antlr3bitset.inl: In member function 'antlr3::BitsetList<AllocatorType>::BitsetType* antlr3::BitsetList<AllocatorType>::bitsetLoad() [with ImplTraits = antlr3::TraitsBase<antlr3::CustomTraitsBase>]':
/opt/scylladb/include/antlr3bitset.inl:54:2: error: nonnull argument 'this' compared to NULL [-Werror=nonnull-compare]
To make it compilable we need to specify '-Wno-nonnull-compare' on cflags.
Message-Id: <1510952411-20722-2-git-send-email-syuu@scylladb.com>
Switch Debian 3rdparty packages to our OBS repo
(https://build.opensuse.org/project/subprojects/home:scylladb).
We don't use 3rdparty packages on dist/debian/dep, so dropped them.
Also we switch Debian to gcc-7.2/boost-1.63 on same time.
Due to packaging issues following packages doesn't renamed our 3rdparty
package naming rule for now:
- gcc-7: renamed as 'xxx-scylla72', instead of scylla-xxx-72.
- boost1.63: doesn't renamed, also doesn't changed prefix to /opt/scylladb
Message-Id: <1510952411-20722-1-git-send-email-syuu@scylladb.com>
This series reworks handling of range tombstones in reversed queries
so that they are applied to correct rows. Additionally, the concept
of flipped range tombstones is removed, since it only made it harder
to reason about the code.
Fixes#2982.
* https://github.com/pdziepak/scylla fix-reverse-query-range-tombstone/v2:
streamed_mutation: fix reversing range tombstones
range_tombstone: drop flip()
tests/cql_query_test: test range tombstones and reverse queries
tests/range_tombstone_list: add test for range_tombstone_accumulator
_current_pi_idx was not reset from advance_to_next_partition(), which
is used when we skip to the next partition before fully consuming
it. As a result, if we try to skip to a clustering position which is
before the index block used by the last skip in the previous
partition, we would not skip assuming that the new position is in the
current block. This may result in more data being read from the
sstable than necessary.
Fixes#2984
Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>
It will be used in sstable_mutation_reader when the reader
will be used to implement sstable::read_row.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Streamed mutation won't be used any more so get_next_partition
and on_partition_finished are more suitable names.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Those methods have to be below sstable_mutation_reader because
they will be using the reader instead of streamed_mutation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This is the first step which still uses streamed_mutation.
Next step will be to get rid of streamed_mutation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Right now reversed streamed mutation emits range tombstones after the
mutation fragments affected by them. This breakes the queries.
This patch reworks the way range tombstones are handled in reversed
streams:
- range tombstones are no longer flipped -- invariant that start bound
is smaller than the end bound always holds
- in reversed streams they are ordered by their end_position()
Fixes#2982.
* seastar 11ad0b1...78cd87f (3):
> Merge "http: Use output stream for files" from Amnon
> tutorial: a section about when_all() and when_all_succeed()
> Merge "Power8 related changes (what's left of them)" from Vlad
As can be seen in one of the traces in #2958, the copy constructor
of index_entry is called in response to std::vector<index_entry>::push_back(index_entry&&).
This is wasteful. Fix by providing the full suite of constructors/assignment operators.
Message-Id: <20171116121608.5580-1-avi@scylladb.com>
"This patch series addresses #2929. The objective is to eliminate global
state from the implementation and use of all access-control functionlity.
I've made every effort to make these patches logically independent and
incremental, but the final patch is big: this was necessary because
eliminating the global instances themselves is an atomic change."
* 'jhk/non_global_auth/v2' of https://github.com/hakuch/scylla:
auth: Switch to sharded service
tracing/trace_keyspace_helper: Use internal `client_state`
auth: Make the QP an explicit dependency
auth: Unify Java class name attributes
auth: Make life-time control more consistent
auth: Move metadata constants
auth: Don't expose internal constant
auth: Extract `permissions_cache`
utils/loading_cache: Include necessary dependency
auth: Fix static constant initialization
auth: Extract `delayed_tasks` from `auth.cc`
This change appears quite large, but is logically fairly simple.
Previously, the `auth` module was structured around global state in a
number of ways:
- There existed global instances for the authenticator and the
authorizer, which were accessed pervasively throughout the system
through `auth::authenticator::get()` and `auth::authorizer::get()`,
respectively. These instances needed to be initialized before they
could be used with `auth::authenticator::setup(sstring type_name)`
and `auth::authorizer::setup(sstring type_name)`.
- The implementation of the `auth::auth` functions and the authenticator
and authorizer depended on resources accessed globally through
`cql3::get_local_query_processor()` and
`service::get_local_migration_manager()`.
- CQL statements would check for access and manage users through static
functions in `auth::auth`. These functions would access the global
authenticator and authorizer instances and depended on the necessary
systems being started before they were used.
This change eliminates global state from all of these.
The specific changes are:
- Move out `allow_all_authenticator` and `allow_all_authorizer` into
their own files so that they're constructed like any other
authenticator or authorizer.
- Delete `auth.hh` and `auth.cc`. Constants and helper functions useful
for implementing functionality in the `auth` module have moved to
`common.hh`.
- Remove silent global dependency in
`auth::authenticated_user::is_super()` on the auth* service in favour
of a new function `auth::is_super_user()` with an explicit auth*
service argument.
- Remove global authenticator and authorizer instances, as well as the
`setup()` functions.
- Expose dependency on the auth* service in
`auth::authorizer::authorize()` and `auth::authorizer::list()`, which
is necessary to check for superuser status.
- Add an explicit `service::migration_manager` argument to the
authenticators and authorizers so they can announce metadata tables.
- The permissions cache now requires an auth* service reference instead
of just an authorizer since authorizing also requires this.
- The permissions cache configuration can now easily be created from the
DB configuration.
- Move the static functions in `auth::auth` to the new `auth::service`.
Where possible, previously static resources like the `delayed_tasks`
are now members.
- Validating `cql3::user_options` requires an authenticator, which was
previously accessed globally.
- Instances of the auth* service are accessed through `external`
instances of `client_state` instead of globally. This includes several
CQL statements including `alter_user_statement`,
`create_user_statement`, `drop_user_statement`, `grant_statement`,
`list_permissions_statement`, `permissions_altering_statement`, and
`revoke_statement`. For `internal` `client_state`, this is `nullptr`.
- Since the `cql_server` is responsible for instantiating connections
and each connection gets a new `client_state`, the `cql_server` is
instantiated with a reference to the auth* service.
- Similarly, the Thrift server is now also instantiated with a reference
to the auth* service.
- Since the storage service is responsible for instantiating and
starting the sharded servers, it is instantiated with the sharded
auth* service which it threads through. All relevant factory functions
have been updated.
- The storage service is still responsible for starting the auth*
service it has been provided, and shutting it down.
- The `cql_test_env` is now instantiated with an instance of the auth*
service, and can be accessed through a member function.
- All unit tests have been updated and pass.
Fixes#2929.
Rather than have all uses of the QP in auth reference global variables,
we supply a QP reference to both the authenticator and authorizer on
construction.
The caller still references a global variable when constructing the
instances, but fixing this problem is a much larger task that is out of
scope of this change.
This change is motivated partly be aesthetics, but more significantly
due to the future work to refactor `auth` into a sharded service. Since
doing so will require writing `auth::auth` from scratch, these
constants (and other common functionality) need a new home.
Using "Meyer's singletons" eliminate the problem of static constant
initialization order because static variables inside functions are
initialized only the first time control flow passes over their
declaration.
Fixes#2966.
This simple task scheduler is used by the auth module to delay metadata
creation until the system is settled.
Extracting it out allows the `auth` module to be refactored into a
sharded service and for other components of `auth` to make use of it.
Fixes#2965.
This patchset prepares sstables read path for flat_mutation_reader.
It cuts some dependencies between classes and replaces
sstables::mutation_reader with ::mutation_reader. This will make it
possible to gradually convert the code to flat_mutation_reader because
we have converters between flat_mutation_reader and ::mutation_reader.
* seastar-dev.git haaawk/flat_reader_prepare_sstables_rebased
Reduce dependencies from mp_row_consumer to sstable_streamed_mutation
Replace sstables::mutation_reader with ::mutation_reader
Remove range_reader_adaptor
Remove sstable_range_wrapping_reader
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This will make migration to flat_mutation_reader much
easier and sstables::mutation_reader is going away with
this migration anyway.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Before this patch mp_row_consumer was using sstable_streamed_mutation
in two ways:
1. Populate sstable_streamed_mutation's buffer with mutation_fragments
2. Advance sstable_streamed_mutation's sstable_data_source to new position.
We can easily reduce those dependencies only to the first one.
This will reduce the coupling between those classes and simplify
the flow of execution.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"Otherwise, such strategies couldn't behave as expected when it needs to do STCS."
* 'respecting_stcs_options_v2' of github.com:raphaelsc/scylla:
tests: enable twcs test that relied on size-tiered properties
twcs: respect stcs options by forwarding them to stcs method
lcs: forward stcs options to respect them
stcs: make most_interesting_bucket respect size-tiered options
stcs: make most_interesting_bucket respect thresholds
compaction: make size_tiered_most_interesting_bucket static method of stcs class
stcs: introduce new ctor
stcs: make header self contained
stcs: inline function definition so as not to break one definition rule
Broken in f570e41d18.
Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.
Fixes regression in dtest:
consistency_test.py:TestAvailability.test_simple_strategy
which was expected to get "unavailable" exception but it was getting a
timeout.
Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
Make sure loading_cache::stop() is always called where appropriate:
regardless whether the test failed or there was an exception during the test.
Otherwise a false-alarm use-after-free error may occur.
Fixes#2955
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1510625736-3109-1-git-send-email-vladz@scylladb.com>
"Fixes #2944."
* tag 'tgrabiec/cache-exception-safety-fixes-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add test for exception safety of multi-partition scans
tests: row_cache: Add test for exception safety of single-partition reads
tests: mutation_source_tests: Always print the seed
tests: Disable alloc failure injection in test assertions
tests: Avoid needless copies
row_cache: Fix exception safety of cache_entry::read()
row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data
row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh()
cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation
mvcc: Lift noexcept off partition_snapshot_row_weakref assignment/constructors
cache_streamed_mutation: Make advancing to the next range exception-safe
cache_streamed_mutation: Make add_clustering_row_to_buffer() exception-safe
cache_streamed_mutation: Make drain_tombstones() exception-safe
cache_streamed_mutation: Return void from start_reading_from_underlying()
cache_streamed_mutation: Document invariants related to exception-safety
streamed_mutation: Add reserve_one()
lsa: Guarantee invalidated references on allocating section retry
mvcc: partition_snapshot_row_cursor: Mark allocation points
BOOST_TEST_MESSAGE() is not logged by default, and for some tests we
don't want to enable that because it's too noisy. But we need to know
the seed to reproduce a failure, so we better to always print it.
If assignment to _lower_bound in the "_secondary_in_progress = true;"
case in do_read_from_primary() throws due to allocation failure, the
update section will be retried and we will take the not_moved path,
skipping the range which was discontinuous and was supposed to be read
from underlying.
Fix by redoing lookup using _lower_bound in case the section is
retried. When we retry, _primary.valid() will be false. We need to
ensure now that _lower_bound is always valid.
Fixes#2944.
Changing _ck_ranges_curr and _lower_bound should be atomic, either
both fail or both succeed. Currently it could happen that if
position_in_partition::for_range_start() fails, _ck_ranges_curr would
be advanced but _lower_bound not.
We need to maintain the following invariants:
(1) no fragment with position >= _lower_bound was pushed yet
(2) If _lower_bound > mf.position(), mf was emitted
Before this patch (1) could be violated if drain_tombstones() failed
in the middle. (2) could be violated if push_mutation_fragment()
failed.
There is existing code (e.g. use of partition_snapshot_row_cursor in
cache_streamed_mutation) which assumes that references will be
invalidated when bad_alloc is thrown from allocating_section. That is
currently the case because on retry we will attempt memory reclamation
which will invalidate references either through compaction or
eviction. Make this guarantee explicit.
thrift/server.cc:237:6: required from here
thrift/server.cc:236:9: error: cannot call member function ‘void thrift_server::maybe_retry_accept(int, bool, std::__exception_ptr::exception_ptr)’ without object
maybe_retry_accept(which, keepalive, std::move(ex));
gcc version: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113184537.10472-1-raphaelsc@scylladb.com>
"The following patches convert streaming and repair code to the new
flat mutation reader interface. In particular this involves changing::
- fragment_and_freeze() -- a consumer that fragments and freezes mutations
- checksum computation for repair which until now was using two-level
mutation_reader/streamed_mutation interface
- multi_range_reader -- a mutation reader that automatically fast
forwards to between given partiton ranges"
* tag 'flat_mutation_reader-streaming/v2' of https://github.com/pdziepak/scylla: (24 commits)
mutation_reader: drop multi_range_reader
db: convert make_streaming_reader() to flat_mutation_reader
tests/flat_mutation_reader: add test for multi range reader
tests/flat_mutation_reader_assertions: add fast_forward_to()
tests/simple_schema: add to_ring_positions() helper
flat_mutation_reader: convert flat_multi_range_mutation_reader
flat_mutation_reader: add partition_range_forwarding
flat_mutation_reader: make pop_mutation_fragment() public
flat_mutation_reader: copy multi_range_mutation_reader
streamed_mutation: drop mutation_hasher
tests/flat_mutation_reader: add test for partition checksum
repair: convert partition_checksum::compute_streamed() to flat streams
repair: make partition_hasher consume flat mutation streams
mutation_hasher: copy mutation_hasher to repair.cc
partition_start: make partition_tombstone() const
partition_checksum: introduce compute() for flat_mutation_reader
db: drop single-range make_streaming_reader()
fragment_and_freeze: drop streamed_mutation overload
stream_transfer_task: switch to flat_mutation_reader
tests/flat_mutation_reader: add test for fragment_and_freeze
...
flat_mutation_reader::partition_range_forwarding and
mutation_reader::forwarding are aliases of the same type. The change was
necessary in order to make mutation_reader::forwarding available in
flat_mutation_reader.hh even though it is included by mutation_reader.hh
flat_mutation_reader public interface already exposes low leve
is_buffer_empty() and is_buffer_full() adding pop_mutation_fragment()
will make implementation of intermediate readers more straightforward.
There is a user of fragment_and_freeze() (streaming) that will need
to be able to break the loop Right now, it does that between
streamed_mutation, but that won't be possible after we switch to flat
readers.
partition_entry::read() calls open_version() under standard allocator,
but it may allocate a new partition version if a snapshot already
exists which was created in an earlier phase. Versions are supposed to
be allocated using region's allocator, they will be freed using
region's allocator. LSA will delegate free() to the standard allocator
correctly in this case, but it will subtract from its
_non_lsa_occupancy, assuming the allocation was done through it. This
will corrupt occupancy() for cache region.
Fixes#2948.
Message-Id: <1510229584-14398-1-git-send-email-tgrabiec@scylladb.com>
"Ensure stop() waits for the accept loop to complete to avoid crashes
during shutdown."
* 'thrift-server-stop/v4' of https://github.com/duarten/scylla:
thrift/server: Restore code format
thrift/server: Stopping the server waits for connection shutdown
thrift/server: Abort listeners on stop()
thrift/server: Avoid manual memory management
thrift/server: Add move ctor for connection
thrift/server: Extract retry logic
thrift/server: Retry with backoff for some error types
thrift/server: Retry accept in case of error
This patch ensures the future returned from stop() resolves only when
all connections and listeners are no longer in use.
Fixes#2657Fixes#2942
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In case of errors like ECONNABORTED, we want to retry accepting
connections. Right now we immediately retry the accept, but in
subsequent patches we introduce a backoff for other types of errors.
We also consider fatal errors like EBADFD, which should not trigger a
retry.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Currently, we use type type of the column as the accumulator when we
average it. This can easily overflow, e.g. (2^31-1)+(3) = overflow.
Fix by using __int128 for the accumulator. It's not standard, but
it's way more efficient and simpler than the alternatives.
Inspired by CASSANDRA-12417, but much simpler due to the availability
of __int128.
Message-Id: <20171112173529.30764-1-avi@scylladb.com>
"This patch adds support for varint and decimal to aggregate functions.
Some other types (like byte or smallint) weren't supported and they are
supported by C*. So their aggregate functions were added as well.
To allow aggregate functions for big_decimal, following methods were added
to big_decimal type:
* Division by int64_t that preservers number of decimal digits.
* Operator += .
* Comparison operators.
Fixes #2842."
* 'danfiala/scylla-2842-send-002' of https://github.com/hagrid-the-developer/scylla:
tests: Add tests for aggregate functions.
tests: Add tests for big_decimal type.
cql3/functions: Add aggregate functions for big_decimal.
utils/big_decimal: Added necessary operators and methods for aggregate functions.
cql3/functions: Add aggregate functions for types for which it is trivial.
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).
However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.
This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).
After that is done, we need to make sure that dirty memory is not seen
as freed until the cache update is done. Until a particular partition is
moved to the cache it is not evictable. As a result we can OOM the
system if we have a lot of pending cache updates as the writes will not
be throttled and memory won't be made available.
This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.
I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.
Fixes#1942
* git@github.com:glommer/scylla.git glommer/update-cache-v4:
row_cache: modernize use of seastar threads
mutation_partition: estimate size of partition
memtable: factor out calculation of memtable_entry memory size
memtable: add a method to export memtable's dirty memory manager
dirty_memory_manager: block if we hit the real dirty limit
dirty_memory_manager: add functions to manipulate real dirty
partition: add method to calculate memory size of a partition
row cache: pin real dirty during cache updates.
This patch fixes 'DROP INDEX' CQL statement to also drop the underlying
index view automatically so that we don't leave unused materialized
views behind.
Message-Id: <1510303421-15945-1-git-send-email-penberg@scylladb.com>
Right now, once a region is moved to the cache is no longer visible to
the dirty memory system. Not as real dirty nor virtual dirty.
The problem is that until a particular partition is moved to the cache
it is not evictable. As a result we can OOM the system if we have a lot
of pending cache updates as the writes will not be throttled and memory
won't be made available.
This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.
I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.
Fixes#1942
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Once that is added, also add a method to a memtable entry to calculate
the entire size of a memtable entry. Right now we only have one method
to calculate the size minus rows.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are times in which we want to add and remove real dirty memory
without impacting virtual dirty. One such example is the cache update
process, where real dirty is the limiting factor.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).
However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.
This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).
A next step is to add a controller that will increase the priority of
the tasks involving in releasing real dirty memory if we get dangerously
close to the threshold.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The total size is the sum of two components. Add a method that
does that sum so this code gets easier to reuse.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the memtable flusher, we account for the size of a partition as we
read them. However, there are other points in the architecture where we
would like to calculate the size of a partition in a point in which we
are not reading it. One such example is the cache update process.
This patch enhances the mutation_partition adding a method that returns
the total size for this partition.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
For a while now we have an async() function, that simplifies the code by not
needing to issue an explicit join. This patch converts the row cache to use
async() as well, which most of our code already does. Doing so will make
it easier to make changes to update_cache.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"Implement flat_mutation_reader::consume and add tests for it.
For that implement flat_mutation_reader_from_mutations and
read_mutation_from_flat_mutation_reader."
* 'haaawk/flat_reader_consume_v3' of github.com:scylladb/seastar-dev:
Add tests for flat_mutation_reader::consume
Add tests for flat_mutation_reader utils
Introduce read_mutation_from_flat_mutation_reader
Make mutation_rebuilder streamed_mutation independent
flat_mutation_reader_from_mutation: support multiple mutations
Introduce flat_mutation_reader::consume
Move FlattenedConsumer concept to flat_mutation_reader.hh
"This patchset introduces memtable::make_flat_reader that returns
flat_mutation_reader and converts internal memtable readers into
flat_mutation_readers.
It also introduces some utility methods like make_forwardable and
make_partition_snapshot_flat_reader."
* 'haaawk/flat_reader_memtable_v4' of github.com:scylladb/seastar-dev:
Turn scanning_reader into flat_mutation_reader
Change memtable_entry::read to return flat_mutation_reader
Make iterator_reader independent from mutation_reader
Introduce make_partition_snapshot_flat_reader
Prepare partition_snapshot_flat_reader
Introduce flat_mutation_reader_from_mutation
Prepare flat_mutation_reader_from_mutation
Introduce make_forwardable
Prepare make_forwardable for flat_mutation_reader
Introduce empty_flat_reader
memtable: Introduce make_flat_reader
mutation_rebuilder will be used not only with streamed_mutations
but also with flat_mutation_readers so it's better for it to be
independent from streamed_mutation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Rename flat_mutation_reader_from_mutation to
flat_mutation_reader_from_mutations.
Make it work with std::vector<mutation> instead of a single
mutation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This concept will be used both in flat_mutation_reader.hh
and mutation_reader.hh. mutation_reader.hh includes
flat_mutation_reader.hh so we have to move the concept to
make it accessible in both files.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Make it possible for a mutation_source to be created both for sources
that use old mutation_reader and new flat_mutation_reader.
Add tests for flat_mutation_reader::next_partition to run_mutation_source_tests.
* seastar-dev.git 'dev/haaawk/flat_reader_mutation_source_v3':
Remove mutation_reader.hh dependency from flat_mutation_reader.hh
Prepare mutation_source for more than one implementation
Add flat reader mutation source implementation
Add mutation_source::make_flat_mutation_reader
Use mutation_source::make_flat_mutation_reader in tests
Add flat_mutation_reader_assertions
Add test for flat_mutation_reader::next_partition
This commit creates a copy of partition_snapshot_reader
and names it partition_snapshot_flat_reader.
This new class will be turned into a flat_mutation_reader
in the next commit.
The purpose of this commit is to make it easier to review the next
commit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This is a utility method that will be handy in conversion
from mutation_reader to flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This commit copies streamed_mutation_from_mutation
from streamed_mutation to flat_mutation_reader
and renames it to streamed_mutation_from_mutation_copy.
This copy will be used as a base for
flat_mutation_reader_from_mutation.
The purpose of this commit is to make it easier to review the next
commit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It will add the ability to fast_forward_to on position_range
to flat_mutation_reader that does not have this ability.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This commit copies make_forwardable from streamed_mutation
to flat_mutation_reader and renames it to make_forwardable_copy.
This copy will be used as a base for make_forwardable implementation
for flat_mutation_reader.
The purpose of this commit is to make it easier to review the next
commit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This method creates a flat_mutation_reader
instead of mutation_reader. All users will be gradually
converted to the new interface. make_reader is implemented
using make_flat_reader and will be removed once all users
are migrated.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This will be used as an intermediate state of migration
from mutation_reader to flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
There will be a second implementation that will be used by
sources that are converted to flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It's not needed and causes cyclic dependency when we need
flat_mutation_reader in mutation_source.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We don't support non-PK restrictions correctly as explained in commit
3c90607 ("tests/cql_query_test: Fix view creation in
test_duration_restrictions()") and Apache Cassandra doesn't support them
for MVs either. Change some test cases to not rely on them.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171107165138.3176-1-duarte@scylladb.com>
f44131226a introduced a regression where for some verbs we would
return partitions in their natural sort order, but since thrift
partition ranges can wrap-around, what we need to preserve is query
order.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171103201118.18175-1-duarte@scylladb.com>
Fixes#2938.
* 'tgrabiec/fix-range-tombstone-list-exception-safety-v1' of github.com:scylladb/seastar-dev:
tests: range_tombstone_list: Add test for exception safety of apply()
tests: Introduce range_tombstone_list assertions
cache: Make range tombstone merging exception-safe
range_tombstone_list: Introduce apply_monotonically()
range_tombstone_list: Make reverter::erase() exception-safe
range_tombstone_list: Fix memory leaks in case of bad_alloc
mutation_partition: Fix abort in case range tombstone copying fails
managed_bytes: Declare copy constructor as allocation point
Integrate with allocation failure injection framework
We don't support non-PK restrictions correctly as explained in commit
3c90607 ("tests/cql_query_test: Fix view creation in
test_duration_restrictions()") and Apache Cassandra doesn't support them
for MVs either. Disable the tests, but don't remove them because they
will be resurrected once CASSANDRA-13832 is fixed.
Message-Id: <1510052422-3478-1-git-send-email-penberg@scylladb.com>
range_tombstone_list::apply() has no exception safety guarantees about
the logical state. The target mutation_partition in cache should be
assumed to be left in unspecified state. In particular, some of the
preexisting overlapping tombstones may be removed and not reinserted,
so the cache would be missing some of the range tombstone information
in case the whole allocating section fails.
Use apply_monotonically() which provides the needed guarantees.
Fixes#2938.
erase_undo_op() constructor takes ownership of *it, and destroys it
when it goes out of scope. If emplace_back() fails, *it would be
destroyed before being removed from its container (_dst._tombstones).
Fix by making sure _ops.emplace_back() won't fail.
If exception is thrown from _row_tombstones.apply(), _rows will be
left uncleared. This will trigger assertion in bi::set_member_hook
destructor, which assrts that the hook is not linked.
Always clear _rows.
Because of the small size optimization, not all copies will call the
allocator, so allocation failure injection may miss this site if the
value is not large enough. Make the testing more effective by marking
this place explicitly as an allocation point.
* seastar d71922c...8040cab (4):
> util: Introduce support for allocation failure injection
> Adding dpdk-port-index as a command line option with default value of 0
> core/sharded: Introduce invoke_on_others()
> noncopyable_function: improve support for capturing mutable lambdas
"Fixes #2933
Fixes regressions introduced by config restructuring.
Allows "base" config to handle errors by warning, while
other uses can opt otherwise."
[pdziepak: resolved merge conflict]
* 'calle/cfgfix' of github.com:scylladb/seastar-dev:
config_test: Use error handler (ignore errors) + add error test
config: Resurrect command line aliases that where lost
main: Use error handler for config parse
config_file: Add optional "error_handler" to yaml parse functions
"This patch series adds support for secondary index queries using the
backing index view that's created when CREATE INDEX statement is
executed.
Example:
-- Create keyspace and table:
CREATE KEYSPACE ks WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1};
CREATE TABLE ks.users (
userid uuid,
name text,
email text,
country text,
PRIMARY KEY (userid)
);
-- Create secondary indexes:
CREATE INDEX ON ks.users (email);
CREATE INDEX ON ks.users (country);
-- Insert some data:
INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Bondie Easseby', 'beassebyv@house.gov', 'France');
INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Demetri Curror', 'dcurrorw@techcrunch.com', 'France');
INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Langston Paulisch', 'lpaulischm@reverbnation.com', 'United States');
INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Channa Devote', 'cdevote14@marriott.com', 'Denmark');
-- Query on the secondary-index backed non-primary keys:
SELECT * FROM ks.users WHERE email = 'beassebyv@house.gov';
userid | country | email | name
--------+---------+-------+------
022238c8-5213-44b5-959e-4e3e1b032f85 | France | beassebyv@house.gov | Bondie Easseby
(1 rows)
SELECT * FROM ks.users WHERE country = 'France';
userid | country | email | name
--------------------------------------+---------+-------------------------+----------------
2152d85a-61f6-4eab-af4d-e7e7d0872319 | France | beassebyv@house.gov | Bondie Easseby
59fddb6d-bfc9-4636-a9a0-85383fd815ee | France | dcurrorw@techcrunch.com | Demetri Curror
Known imitations:
- Only regular column indexes return results. Indexing primary key
components like clustering keys return empty result set because of
index view query partition key serialization issues that will be fixed
in subsequent patches.
- Secondary index queries are not paginated, which can cause problems
for queries that return a large number of rows.
- Multiple restrictions don't work correctly if one of them is backed by
a secondary-index.
- Only one secondary-indexed restriction per query is supported -- other
restrictions are ignored.
- Compound partition keys are not supported.
- ALLOW FILTERING on non-primary key columns does not work correctly
without secondary index (see issue #2200)."
* 'penberg/cql-2i-queries/v2' of github.com:penberg/scylla:
tests/cql_query_test: Add test case for secondary index queries
cql3: Secondary-index backed select statements
index: Fix index view schema when primary key component is indexed
tests/cql_query_test: Fix view creation in test_duration_restrictions()
cql3/restrictions: Add statement_restrictions::index_restrictions() helper
index: Implement index::supports_expression() for EQ operator
cql3: Make operator_type class non-copyable
index: Fix index::supports_expression() operator parameter type
cql3: Implement statement_restriction index validation
Before the patch we appended and queried at the front. Insert at the
front instead, so that writes and reads overlap. Stresses eviction and
population more.
Message-Id: <1506369562-14892-1-git-send-email-tgrabiec@scylladb.com>
"We currently can't insert row entries at any position_in_partition,
but only at full keys and after all keys. If a query range has bounds
such that we have to insert a dummy entry at non-representable position
then information about range continuity will not be fully populated.
In particular, single-row queries of a row which is not present in sstables
will miss when repeated again.
The series fixes the problem by marking the whole query range as continuous
by inserting dummy entries at boundaries when necessary.
Refs #2579."
* tag 'tgrabiec/cache-range-continuity-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add test for population of single rows
tests: Add test for population of continuity
tests: mutation_reader_assertions: Introduce produces_compacted()
mutation: Introduce apply(mutation_fragment)
cache: Document invariants of cache_streamed_mutation::_lower_bound
cache_streamed_mutation: Special-case population for singular ranges
query: Introduce is_single_row()
cache_streamed_mutation: Increment mispopulation counter when can't populate due to eviction
cache_streamed_mutation: Override continuity of older versions when populating
cache_streamed_mutation: Mark whole query range as continuous
tests: cache_streamed_mutation: Allow creating expected_row at any position_in_partition
cache_streamed_mutation: Populate continuity when range adjacent to non-latest version rows
cache_streamed_mutation: Avoid lookup in maybe_add_to_cache() in more cases
row_cache: Make read_context::key() valid before reading from underlying starts
mutation_partition: Allow creating rows_entry at any clustered position_in_partition
position_in_partition: Do not use -2 and +2 weights
clustering_ranges_walker: Make contains() drop range tombstones adjacent to query range
mutation_partition: Remove delegating_compare()
mvcc: Print iterators in operator<< for partition_snapshot_row_cursor
mvcc: Introduce partition_snapshot_row_weakref
mvcc: Make the null state of partition_snapshot::change_mark explicit
mvcc: Add partition_snapshot::region() getter
mvcc: Add partition_snapshot::schema() getter
position_in_partition: Introduce before_key()
position_in_partition: Introduce min()
position_in_partition: Introduce for_static_row()
While not being a real unit tests memory_footprint can be a quite useful
tool and running it among other tests will ensure that we will notice
when it gets broken.
Message-Id: <20171102160233.6756-2-pdziepak@scylladb.com>
When created cache registers several metrics, since attempts to create
an already existing metrics result in an exception being thrown it is no
longer possible to have two cache instances at the same time. This is
exactly what happens in memory_footprint: one (useless) cache object is
created through a call to do_with_cql_env() and, then, memory_footprint
explicitly creates another one (not a useless one).
The tests itself doesn't really need a full cql environment and the only
reason it was added is so that storage_service is initialised and various
code paths can query for the available cluster features. This can be
done in a much lightweight way using storage_service_for_tests.
Fixes memory_footprint failure (until next time we decide there is
nothing wrong with globals).
Message-Id: <20171102160233.6756-1-pdziepak@scylladb.com>
$basearch isn't parsed as expected, the finaly baseurl is wrong.
We only have x86_64 arch in external 3rdparty repository, and
the conf file is only for x86_64, so it's fine to use hardcode
x86_64.
The problem was introduced by commit
b5e83ebd94 ("dist/redhat: switch 3rdparty
packages to external build service").
Fixes#2930
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <708f46a7c36623e86fee278462c80db1eff3b820.1509700430.git.amos@scylladb.com>
This patch adds support for secondary-index backed select statements.
Current select_statement class is split into two separate classes:
primary_key_select_statement that retains regular query behavior and
indexed_table_select_statement that introduces the new secondary-index
backed query logic. One of the two behaviors is selected at query
preparation time to minimize overhead for non-indexed queries.
This fixes index view schema to exclude indexed column when a primary
key component like clustering key is indexed. This fixes a server crash
when CREATE INDEX statement is executed on a clustering key column.
The materialized view created in test_duration_restriction() restricts
on a non-PK column. Since Scylla's ALLOW FILTERING and secondary index
validation path is broken, once we start to do secondary index queries,
query processor thinks there's a secondary index backing that non-PK
column and fails because it's unable to find such column.
Fix up the view to only trigger the duration type validation error we're
interested in here.
Fixes the case of continuity not being populated when the row which is
the upper bound of the population range belongs to a non-latest
version. In such case we wouldn't mark the range as continuous,
because we can't modify rows of non-latest versions. To fix this,
create an empty entry in latest version which will just override the
continuity flag of the old entry.
Before this patch only ranges between returned row fragments were
marked as continuous. In the extreme case, there could be no such
fragments, in which case next read would miss as well. To avoid this,
mark whole query range as continuous by inserting dummy entries when
necessary.
Refs #2579.
Current code will not mark the range as continuous if the previous
entry does not come in the latest version. Fix that by switching
to partition_snapshot_row_pointer, which is capable of checking
in older versions as necessary.
Also, we avoid the key comparison if we know that the iterator
is still valid.
So that we can call cache_streamed_mutation::can_populate() before
we start reading from underlying. Will be needed in upcoming changes
which insert dummy entries when falling back to underlying.
::weight() is using those values for excl_end and excl_start in order
to be able to represent non-overlapping ranges. In their model the end
bound is inclusive. We don't need this, since position_range has end
bound exclusive.
This change makes that:
position_in_partition::after_key(y)
== position_in_partition::for_range_end(clutering_range::make({x}, {y})
This patch changes the way the multiget_{slice, count} verbs return
their results, by ensuring a queried key that produced no results is
still present in the returned map, associated with an empty list.
This is not required by the thrift interface, and it is a performance
step back, but matches the behavior of Apache Cassandra.
Said behavior is relied upon by projects like JanusGraph, whose
integration with Scylla motivated this patch.
Fixes#2900
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171019161104.22797-2-duarte@scylladb.com>
Most common operations, like multiget_count and multiget_slice, return
maps. So, instead of keeping a vector internally in column_visitor
that we later transform into a map, keep a map that we transform into
a vector for the uncommon operations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171019161104.22797-1-duarte@scylladb.com>
"This series cleans-up the query processor header and source file,
including deleting dead Java code.
There are no functional or interface changes.
I've run all unit tests and observed no failures."
* 'jhk/clean_up_qp/v2' of github.com:hakuch/scylla:
cql3/query_processor: Fix formatting
cql3/query_processor: Organize headers
cql3/query_processor.hh: Consolidate `public` and `private` sections
cql3/query_processor: Remove dead Java code
* seastar 8babd1f...d71922c (11):
> configure.py: add -Wno-sign-compare to compile Boost.Test with gcc-7
> log: Print nested exceptions
> reactor: do not account non idle activity for total idle time calculation
> execution_stage: defer execution less aggressively
> Fix -Wreturn-type warnings
> cpu scheduler: make _reciprocal_shares_times_2_32 wider to avoid overflow problems
> noncopyable_function add bool operator
> execution_stage: make make_execution_stage return a named type
> memory: support overriding the default allocator page size
> memory: fix crash during startup with large page_size
> core: io_destroy is missing when destructing reactor, which causes io_context leak
We are failing to build .deb package on pbuilder due to lack of build time
dependencies so we need add those packages on Build-Depends, also we need to
follow Debian packaging style for the package contains python scripts.
Fixes#2918
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457215-11552-1-git-send-email-syuu@scylladb.com>
Switch to g++-7/boost-1.63 for Ubuntu 14.04/16.04 that newly provided via
our 3rdparty PPA.
To make Scylla compilable with boost-1.63/g++-7, we need to disable following
warnings:
- misleading-indentation
- overflow
- noexcept-type
Compile error message:
https://gist.github.com/syuu1228/96acc640c56c3316df5ce6911d60beea
Seastar also has similar problem, it needs to disable 'sign-compare', detail
is in a patch for Seastar.
This update also fixes current Ubuntu 14.04/16.04 compilation error problem,
since errors were come from too old g++/boost.
Fixes#2902Fixes#2903
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457509-13122-1-git-send-email-syuu@scylladb.com>
"When reading a single row it is possible that the read will be satisfied
by just reading from one of the data source candidates. To exploit this
an optimization is employed which sorts data source candidates by their
timestamp and reads mutations from the most recent to the oldest. When
all needed cells are present and their earliest timestamp is still
later than the latest one of the remaining data source the read can be
terminated early.
However this optimization also has the possibility to backfire as the
data sources are read sequentially, so if all of them has to be read
eventually then we will end up worse then without it.
Thus the optimization can be disabled up-front or enabled to only run
until its efficiency degrades below a certain threshold.
Also counters are added to column-families to make it possible to
observe how well it performs.
Benchmarking
Benchmarking was done with disabled cache and at a constant op rate of
4k (1/3 of the max op rate on my box), against 3 sstables containing the
same 10000 rows.
1) Optimization turned off (all sstables read paralelly)
latency mean : 1.3 [simple:1.3]
latency median : 1.0 [simple:1.0]
latency 95th percentile : 2.4 [simple:2.4]
latency 99th percentile : 2.9 [simple:2.9]
latency 99.9th percentile : 8.0 [simple:8.0]
latency max : 13.5 [simple:13.5]
2) Optimization turned on, best case (1 of 3 sstables read)
latency mean : 0.6 [simple:0.6]
latency median : 0.6 [simple:0.6]
latency 95th percentile : 1.0 [simple:1.0]
latency 99th percentile : 1.2 [simple:1.2]
latency 99.9th percentile : 4.4 [simple:4.4]
latency max : 13.4 [simple:13.4]
3) Optimization turned on, best case, IN query (1 of 3 sstables read)
latency mean : 0.7 [simple_in:0.7]
latency median : 0.6 [simple_in:0.6]
latency 95th percentile : 1.1 [simple_in:1.1]
latency 99th percentile : 1.4 [simple_in:1.4]
latency 99.9th percentile : 5.4 [simple_in:5.4]
latency max : 16.8 [simple_in:16.8]
4) Optimization turned on, worst case (3 of 3 sstables read sequentally)
latency mean : 2.8 [simple:2.8]
latency median : 2.3 [simple:2.3]
latency 95th percentile : 5.4 [simple:5.4]
latency 99th percentile : 6.5 [simple:6.5]
latency 99.9th percentile : 13.5 [simple:13.5]
latency max : 19.2 [simple:19.2]
5) Optimization turned on, mid case (2 of 3 sstables read sequentally)
latency mean : 1.4 [simple:1.4]
latency median : 1.1 [simple:1.1]
latency 95th percentile : 2.7 [simple:2.7]
latency 99th percentile : 3.2 [simple:3.2]
latency 99.9th percentile : 7.7 [simple:7.7]
latency max : 15.1 [simple:15.1]"
Ref #324
* 'bdenes/optimize_single_row_read_v6' of github.com:denesb/scylla:
Add unit tests for single_key_sstable_reader
Add counters for the single-key reader optimization
Add single_key_parallel_scan_threshold option
single_key_sstable_reader: optimize single-row queries
single_key_sstable_reader: move reading code into it's own method
Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns()
Add two counters, one to determine how many of the reads fall into the
optimization, and a second one to determine it's effectiveness.
The first one is single_key_reader_optimization_hit_rate. It contains
the rate of reads that the optimization applies to out of all the reads
that go into the single_key_sstable_reader.
The second one, single_key_reader_optimization_extra_read_proportion is
a histogram of the efficiency of the optimization. It contains the
proportion of extra data-sources read. It's a number between 0 and 1,
where 0 is the best case (only one data-source was read) and 1 is the
worst case (all data-sources were read eventually). This is the same
number that is used for the threshold option (see previous patch).
Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1]
range.
Note that single_key_parallel_scan_threshold effectively provides an
upper bound for the proportion as the optimization is turned off as soon
as it goes above that number.
The counters are disabled if single_key_parallel_scan_threshold is set
to 0 disabling the optimization entirely.
This option regulates when exactly the single-key optimization is
considered ineffective and turned off.
The threshold is the proportion of the extra data source candidates that
can be read before the optimization is considered ineffective and
disabled. The proportion is calculated as follows:
(read_data_sources - 1) / (total_data_sources - 1)
We substract 1 from the read_data_sources and total_data_sources to
effectively measure the rate of *extra* data sources we read. This
makes sure that the proportion is meaningful even if e.g. we have only
have a total of 2 data-sources and we read only 1 (best case).
Whenever this number goes above the threshold the optimization is
disabled. The threshold is number between 0 and 1, 0 forces the
optimization off and 1 forces it on. Increase the treshold to favor
throughput over latency for single-row reads, decrease the treshold to
improve latency at the expense of throughput.
If the threshold is > 0 (it's not force disabled) and the optimization
is disabled due to a read crossing the threshold, we will issue
"probing" reads (every 100th read) to determine if the optimization is
worth re-enabling. Probing reads are allowed to run through the
optimization path and if they go below the threshold the optimization is
re-enabled.
For single-row queries that only query atomic cells one can put a lower
bound on the timestamps which may affect the query results and thus rule
out entire data sources. This allows the query to read only those
sstables that actually contribute to the result.
To do this we incrementally move through the sstables overlapping with
the query range, checking after each read mutation whether we already
have a value for all required cells and whether the lower-bound of their
timestamps is higher than the upper-bound of the timestamps of all the
remaining data-sources. When this condition is met we terminate the
read.
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.
Fixes #2855."
* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
gms/gossiper: Remove periodic replication of endpoint state map
gossiper: Check for features in the change listener
gms/gossiper: Replicate changes incrementally to other shards
gms/gossiper: Document validity of endpoint_state properties
storage_service: Update token_metadata after changing endpoint_state
gms/gossiper: Process endpoints in parallel
gms/gossiper: Serialize state changes and notifications for given node
utils/loading_shared_values: Allow Loader to return non-future result
gms/gossiper: Encapsulate lookup of endpoint_state
storage_service: Batch token metadata and endpoint state replication
utils/serialized_action: Introduce trigger_later()
gossiper: Add and improve logging
gms/gossiper: Don't fire change listeners when there is no change
gms/gossiper: Allow parallel apply_state_locally()
gms/gossiper: Avoid copies in endpoint_state::add_application_state()
gms/failure_detector: Ignore short update intervals
"Extracts the yaml/boost-po aspects of the "self-describing" db::config
into an abstract type.
db::config is then reimplemented in said type, removing some of the
slightly cumbersome entanglement with seastar opts (log).
Adds a main hook for additional configuration files (options + file)"
* 'calle/config' of github.com:scylladb/seastar-dev:
main/init: Add registerable configuration objects
db::config: Re-implement on utils/config_file.
utils::config_file: Abstract out config file to external type
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
There is a requirement that whatever is present in token_metadata,
should also be present in endpoint_state. Because of that, we should update
endpoint_state first (set_gossip_tokens).
Apache Cassandra switched to this order as well in commit
b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.
Makes state application faster due to increased parallelism.
Refs #2855.
Bootrap of 11th node, ignoring apply_state_locally() which complete instantly:
Before:
DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms
DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms
DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms
DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms
DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms
After:
DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms
It's possible that a change listener for a later state will run before
change listener for the previous state completes, in which case
node's state can be corruped. For example, the previous change listener
may override sysytem.peers with an old value.
This patch fixes the problem by serializing state changes and
listeners for each node.
The implementation uses loading_shared_values so that the lock remains
alive as long as there is anyone holding it. Using endpoint_state_map
for that doesn't seem appropraite, because entries can be removed from
it while listeners are still running. There is code in the gossiper
which anticipates that entry may be gone across deferring points in
some places.
apply_new_states() always fires change listeners for received values,
even if we already processed the state earlier. Some change listeners
are heavy-weight, e.g. storage_service::handle_state_normal(). We
should avoid calling them more than necessary.
Make sure that we always run the change listeners by putting them in a
defer() block. Otherwise, if exception is thrown in the middle of state
application, change listeners would not be run. Later we would not
detect the change for states which were already applied, and not run
the change listers.
Fixes#2867
It is serialized since e428d06f40. This causes regression in
performance of application state propagation due to reduced
parallelism.
Processing states for each node has high latency due to memtable
flushes triggered by update_tokens() and commitlog syncs done by
system.peers updates, if commitlog sync mode is set to "batch". We
have high internal concurrency for these, so increasing parallelism
significantly reduces time to process all states.
Fixes#2855.
Failure detector decides that a node is down if it hasn't received a change of
its heartbeat for longer than ~11 times the average of past intervals between
updates.
If there are multiple incoming ACKs containing information about the
same node, we may detect and report a change for each of them. This
will cause failure_detector to establish that the average report
period is in milliseconds. After the update storm is over, it will
claim the node failure very soon, because report period will now be a
large multiple of the average.
Fix by not counting short updates into the calculation of average
arrival time.
Fixes#2861.
Handling all the boost::commandline + YAML stuff.
This patch only provides an external version of these functions,
it does not modify the db::config object. That is for a follow-up
patch.
"Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/
One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.
The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.
2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.
About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:
"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."
It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that."
Acked-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Gleb Natapov <gleb@scylladb.com>
* 'prometheus-histograms' of github.com:glommer/scylla:
storage_proxy: change reporting of estimated histograms
estimated_histogram: bring histogram closer to what prometheus expects.
We moved to new dependency package names like
antlr3-c++-dev to scylla-antlr35-c++-dev when we moved to ppa on Ubuntu, but
Debian still uses old dist/debian/dep packages.
So keep using old style package names.
Fixes#2831
Message-Id: <1508245175-2184-1-git-send-email-syuu@scylladb.com>
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.
Refs #2885
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
This patch introduces schema::full_slice(), which returns a
partition_slice selecting the full clustering range, as well as all
static and regular columns. No options aside from the default are
set in that partition_slice.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-1-git-send-email-duarte@scylladb.com>
Fixes: 2898
Typo error in gensalt(). Only returned selected hash method, not
the random salt bytes. Does not prevent the hash function from
operating, but strength is ever so reduced.
Message-Id: <20171016130505.25593-2-calle@scylladb.com>
"This changeset is the first step to flatten mutation_reader.
Then it introduces new mutation_fragment types for partition header and end of partition.
Using those a new flat_mutation_reader is defined.
Finally it introduces converters between new flat_mutation_reader and
old mutation_reader."
* 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev:
Add tests for flat_mutation_reader
Introduce conversion from flat_mutation_reader to mutation_reader
Introduce conversion from mutation_reader to flat_mutation_reader
Introduce flat_mutation_reader
Extract FlattenedConsumer concept using GCC6_CONCEPT
Introduce partition_end mutation_fragment
Introduce a position for end of partition
Introduce partition_start mutation_fragment
Introduce FragmentConsumer
Introduce a position for partition start
streamed_mutation: Extract concepts using GCC6_CONCEPT macro
The command scans random set of objects in a small pool (or, optionally
only objects of a certain size) for vptrs and builds a histogram, so that
most often used vptrs can be easily found. The command is useful to find
"memory leaks" caused by creating of too many tasks of a certain type
which is usually a result of unlimited parallelism somewhere.
Message-Id: <20171015081634.GB21092@scylladb.com>
Those tests run mutation source test for all sources
using conversion to and from flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This reader operates on mutation_fragments instead of
streamed_mutations.
Each partition starts with a partition_header fragment
and ends with end_of_partition fragment.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This fixes a regression introduced in 27a3b4bca9 (master only).
partition_range_cursor assumes that as long as references are valid,
_end is valid as well. But if new entries were inserted before _end,
it may not, if the new entries fall after the query range. This may
result in reads returning partitions from outside the query range.
Message-Id: <1507815478-20269-1-git-send-email-tgrabiec@scylladb.com>
"Based on the functions get_endpoint_state_for_endpoint_ptr(),
get_application_state_ptr() and
endpoint_state::get_application_state_ptr(), this series
cleanups miscelaneous functions related to the gossiper.
It not only removes duplicated code, but also omits many copies.
All pointer usages have been audited for safety."
Acked-by: Asias He <asias@scylladb.com>
Acked-by: Tomasz Grabiec <tgrabiec@scylladb.com>
* 'gossiper-cleanup/v2' of github.com:duarten/scylla: (27 commits)
gms/endpoint_state: Remove get_application_state()
service/storage_service: Avoid copies in prepare_replacement_info()
service/storage_service: Cleanup get_application_state_value()
service/storage_service: Cleanup handle_state_removing()
service/storage_service: Cleanup get_rpc_address()
locator/reconnectable_snitch_helper: Avoid versioned_value copies
locator/production_snitch_base: Cleanup get_endpoint_info()
service/migration_manager: Avoid copies in is_ready_for_bootstrap()
service/migration_manager: Cleanup has_compatible_schema_tables_version()
service/migration_manager: Fix usages of get_application_state()
cache_hit_rate: Avoid copies in get_hit_rate()
gms/endpoint_state: Avoid copies in is_shutdown()
service/load_broadcaster: Avoid copy in on_join()
gms/gossiper: Cleanup get_supported_features()
gms/gossiper: Cleanup get_gossip_status()
gms/gossiper: Cleanup seen_any_seed()
gms/gossiper: Cleanup get_host_id()
gms/gossiper: Removed dead uses_vnodes() function
gms/gossiper: Cleanup uses_host_id()
gms/gossiper: Add get_application_state_ptr()
...
We were taking a reference to a temporary value in different places.
Fix them by using get_application_state_ptr(), which also avoids a copy.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the get_application_state_ptr() function, which
allows access to a versioned_value of a particular endpoint.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Now that we have get_endpoint_state_for_endpoint_ptr(), which does not
return a copy and allows mutating the actual state, we can use it
instead of repeating the lookup code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Have convict() use get_endpoint_state_for_endpoint_ptr(), simplify
logging, and also protect expensive operations by checking the log
level.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
For every finished compaction, we were calculating shards for all
existing tables. With ignore_msb set to 0, it's probably not a big
deal, but if ignore_msb is like 12 and LCS is used (meaning thousands
of tables possibly), the operation may stall the reactor for a
considerable amount of time. That's fixed by caching shards.
Fixes#2875.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171011053424.22308-1-raphaelsc@scylladb.com>
"gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive. This
series introduces a function that returns a pointer and avoids
the copy.
Fixes#764"
* 'endpoint-state/v2' of https://github.com/duarten/scylla:
gossiper: Avoid endpoint_state copies
endpoint_state: const-qualify functions
storage_service: Remove duplicate endpoint state check
This type of mutation_fragment will be used in new mutation_reader
to signal the end of the current partition.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This position will be used for mutation fragment
that represents the end of partition.
This position sorts after all other mutation fragments.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This type of mutation_fragment will be used in new mutation_reader
to signal the beginning of the next partition.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.
This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).
Fixes#764
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Seastar small pools can now fall back to smaller spans. Adjust
the 'scylla memory' command accordingly.
Message-Id: <20171005123935.13503-1-avi@scylladb.com>
* seastar c62bbf9...8babd1f (9):
> Enhanced support for Travis CI: build with and without DPDK support, use varioius compilers (GCC 5/6/7)
> backtrace: Allow whitespace after the backtrace addresses
> test.py: fix typo in noncopyable_function_test
> utils: introduce noncopyable_function
> Revert "utils: introduce noncopyable_function"
> utils: introduce noncopyable_function
> Add seastar-addr2line helper script to decode backtraces
> execution_stage: pass scheduling_group to constructor
> reactor: preempt tasks when a signal is received
"This series implements CAST AS functions in scylla.
It allows to use expressions of the form CAST(x AS type) in select statements.
Primary motivation for this functions came from aggregate functions, because
function avg(.) gives rounded results for interger columns. Now it is possible
to convert such column to float/double and obtain floating point results:
SELECT ... avg(cast(x as double)), ...
Fixes #2280."
* 'danfiala/2280-patch-series-v2' of https://github.com/hagrid-the-developer/scylla:
tests: Add test for CAST AS functions.
cql3: Add support for CAST AS functions to ANTLR grammar.
cql3/selectable: Add selectable::with_cast for CAST AS functions.
cql3/functions: Add support for CAST AS functions.
types:: Add support for CAST AS functions.
types: Moved code that implements conversion of types' values to string.
"This patch series adds backing materialized view for secondary indices.
When a new index is created with the 'CREATE INDEX' statement, a backing
materialized view is created automatically.
For example, assuming the following table:
CREATE TABLE ks1.users (
userid uuid,
email text,
PRIMARY KEY (userid)
);
When the following index is created:
CREATE INDEX user_email ON ks1.users (email);
The following materialized view is also created:
cqlsh> DESCRIBE ks1.users;
<snip>
CREATE MATERIALIZED VIEW ks1.user_email_index AS
SELECT email, userid
FROM ks1.users
WHERE email IS NOT NULL
PRIMARY KEY (email, userid)
WITH CLUSTERING ORDER BY (userid ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CQL queries will use the backing materialized view as part of queries on
indexed columns to fetch the primary keys."
* 'penberg/cql-2i-backing-view/v3' of github.com:scylladb/seastar-dev:
schema_tables: Create backing view for indices
database: Kill obsolete secondary index manager stub
cql3: Wire up secondary index manager
cql3/restrictions: Add term_slice::is_supported_by() function
index: Add secondary_index_manager::create_view_for_index()
index: Add target_parser::parse() helper
cql3/statements: Add index_target::from_sstring() helper
index: Add secondary_index_manager::get_dependent_indices()
index: Add secondary_index_manager::reload()
index: Add secondary_index_manager::list_indexes()
index: Add index class
index: Pass column_family to secondary_index_manager constructor
database: Make secondary index manager per-column family
This patch wires calls to secondary index manager reload() in
merge_tables_and_views() and changes make_update_indices_mutations() to
also create mutations for the backing materialized view. After this
patch, "CREATE INDEX" CQL statement also creates a materialized view.
We are currently collapsing the histograms in 16 points, exponentially
increasing in value, starting from 1.
While reducing the number of points is a worthy goal, the current
configuration caps us at 4ms. Our latencies tend to be higher than this.
Starting from 1 is also a bit of an exhaggeration: rarely are our
latencies in that range. This patch changes reporting so that we
report 20 points starting from 32.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/
One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.
The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.
2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.
About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:
"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."
It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that.
This patch changes the histogram format to be more in line with what
prometheus expect.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The call to std::ref() is not namespace-qualified, and so can conflict
with seastar::ref().
Fix by naming std::ref() explicitly.
Message-Id: <20171004155250.4960-1-avi@scylladb.com>
"Makes authorizer/authenticator actually pluggable (by class name)
and adds a "Transitional" type for both, conforming to the DSE
definition of the types.
The idea is to allow a rolling upgrade of a cluster to
authentication op by first making all clients provide credentials
(ignored by non-auth), then node by node enable auth with
transitional handlers, then ensure user DB is populated and
distributed, and finally rollingly enable strict auth for
each node. Pfew."
Fixes#2836
* auth: Transitional auth wrappers
auth: Make authenticator/authorizer use actual name based lookup
Similar to DSE objects with similar name. Basically ignores
all authentication/authorization except "superuser" login. All others
sessions are treated as anonymous.
Note: like DSE counterparts, a client session must still _use_
authentication to be able to connect, even though the actual content of
the auth is mostly ignored.
Allows registry to give back, for example, shared_ptr etc instead of
solely unique_ptr. If a registry is defined with seastar/std
shared/lw_shared/unique_ptr as "BaseType", the type will assume
this is the intended result type.
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed.
Fixes #2692."
* 'bdenes/restricting_mutation_reader-v5' of https://github.com/denesb/scylla:
Update reader restriction related metrics
Add restricted_reader_test unit test
restricted_mutation_reader: restrict based-on memory consumption
mutation_reader.hh: Move restricted_reader related code
"
The original motivation for the "utils: introduce a loading_shared_values" series was a hinted handoff work where
I needed an on-demand asynchronously loading key-value container (a replica address to a commitlog instance map).
It turned out that we already have the classes that do almost what I needed:
- utils::loading_cache
- sstables::shared_index_lists
Therefore it made sense to find a common ground, unify this functionality and reuse the code both in the classes above and in the
new hinted handoff code.
This series introduces the utils::loading_shared_values that generalizes the sstables::shared_index_lists
API on top of bi::unordered_set with the rehashing logic from the utils::loading_cache triggered by an addition
of an entry to the set (PATCH1).
Then it reworks the sstables::shared_index_lists and utils::loading_cache on top of the new class (PATCH2 and PATCH3).
PATCH4 optimizes the loading_cache for the long timer period use case.
But then we have discovered that we have another "customer" for the loading_cache. Apparently our prepared statements cache
had a birth flaw - it was unlimited in size - unless the corresponding keyspace and/or table are modified/dropped the entries
are never evicted. We clearly need to limit its size and it would also make sense to evict the cache entries that haven't been
used long enough.
This seems like a perfect match for a utils::loading_cache except for prepared statements don't need to be reloaded after
they are created.
Patches starting from PATCH5 are dealing with adding the utils::loading_cache the missing functionality (like making the "reloading"
conditional and adding the synchronous methods like find(key)) and then transitioning the CQL and Thrift prepared statements
caches to utils::loading_cache.
This also fixes #2474."
* 'evict_unused_prepared-v5' of https://github.com/vladzcloudius/scylla:
tests: loading_cache_test: initial commit
cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
cql3: prepared statements cache on top of loading_cache
utils::loading_cache: make the size limitation more strict
utils::loading_cache: added static_asserts for checking the callbacks signatures
utils::loading_cache: add a bunch of standard synchronous methods
utils::loading_cache: add the ability to create a cache that would not reload the values
utils::loading_cache: add the ability to work with not-copy-constructable values
utils::loading_cache: add EntrySize template parameter
utils::loading_cache: rework on top of utils::loading_shared_values
sstables::shared_index_list: use utils::loading_shared_values
utils: introduce loading_shared_values
Update description of existing reader count metrics, add memory
consumption metrics. Use labels to distinguish between system, user and
streaming reads related metrics.
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed."
Fixes#2692.
* 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla:
Update reader restriction related metrics
Add restricted_reader_test unit test
restricted_mutation_reader: restrict based-on memory consumption
mutation_reader.hh: Move restricted_reader related code
* seastar 899fc4e...c62bbf9 (6):
> Merge "CPU Scheduler for seastar" from Avi
> reactor: set SCHED_FIFO policy for timer thread
> future: mark future::wait() as noexcept
> shared_promise: Make get_shared_future() const-qualified
> Remove pessimizing and redundant std::move()-s reported by Clang-tidy utility
> Work around GCC 5 bug: scylladb/seastar#338, scylladb/seastar#339
Fixes#2770.
Fixes#2819.
* seastar 92fdce2...899fc4e (14):
> scollectd: increment the metadata iterator with the values
> Enable Travis CI builds for Seastar.
> tests: Fix httpd test compilation error caused by unconditionally explicit tuple constructor in GCC5: scylladb/seastar#326
> core::shared_future: add available() and failed() methods
> rpc: make sure that _write_buf stream is always properly closed
> log: Fail on attempt to register logger with the same name twice
> Merge "Make backtraces useful on ASLR-enabled machines as well" from Botond
> reactor: add option to bypass fsync
> future-util: modernize do_until() implementation
> future-util: fix do_until() API to not have forwarding references
> input_stream: add rvalue variant of input_stream::consume()
> logger: remove extra spaces after timestamp
> tutorial: lifetime management
> Fix broken link for fsqual failure message
We don't pull schema during rolling upgrade, that is until
schema_tables_v3 feature is enabled on all nodes.
Because features are enabled from gossiper timer, there is a race
between feature enablement and processing of endpoint states which may
trigger schema pull. It can happen that we first try to pull, but
only later enable the feature. In that case the schema pull will not
happen until the next schema change.
The fix is to ensure that pulls abandoned due to feature not being enabled
will be retried when it is enabled.
Fixes sporadic failure in dtest:
repair_additional_test.py:RepairAdditionalTest.repair_schema_test
Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>
Dirty memory manager for non-system column families was being used
when applying mutations to system cfs.
That previously lead to deadlock when updating history. Basically,
write disable waits on compaction, and compaction waits on a write
that would release dirty memory for updating compaction history.
Only using the correct dirty manager wouldn't solve this problem
if write is disabled for system cf, but the problem is completely
solved in addition to previous change which updates history
outside the sstable lock.
Refs #2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>
The reason to do that is because compaction can deadlock if refresh
disables write which waits for compaction, and compaction in turn
waits for dirty memory[1] that would be released by memtable write.
Dirty memory manager for non-system cfs was being used for system cfs,
which was useful for exposing this problem.
[1]: when updating compaction history.
Fixes#2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>
"Fixes the problem of concurrent populations of clustering row ranges
leading to some readers skipping over some of the rows.
Spotted during code review.
Fixes #2834."
* tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev:
tests: mvcc: Add test for partition_snapshot_row_cursor
tests: row_cache: Add test for concurrent population
tests: row_cache: Make populate_range() accept partition_range
tests: Add simple_schema::make_ckey_range()
cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()
mvcc: Keep track of all iterators in partition_snapshot_row_cursor
mvcc: Make partition_snapshot_row_cursor printable
Make get_natural_endpoints return local address iff token metadata
is not yet setup (since that is the one address we already know of).
If a request has a consistency level requiring more endpoints, it
will still fail, but for calls with, for example, CL=ONE, at startup
we will succeed, and more or less act like local strategy. Yet,
further down the line, have data distributed as desired.
Acked-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20170926113512.15707-1-calle@scylladb.com>
We were checking if the cursor is up_to_date(), but this is not enough
to guarantee that the cursor is valid, merely that its iterators are
valid. The cursor may be invalidated even if its iterators are valid
if there was an insertion after cursor's position.
Fixes#2834.
The cursor maintains a heap of iterators in all versions. If rows were
inserted before the latest version's iterator, cursor would not see
them. Fix by redoing the lookup for iterators not in the current row
in maybe_refresh().
Refs #2834.
Will be needed when updating the iterator for latest version. Before
this change, such iterator could be neither in _current_row nor in
_heap.
Besides that, this will allow user to always access the iterator of
latest version, which enables some optimizations in the future of
avoiding unnecessary lookups. get_iterator_in_latest_version() is now
always valid.
Since commit 8378fe190, we disable schema sync in a mixed cluster.
The detection is done using gossiper features. We need to make sure
the features are registerred, and thus can be enabled, before the
bootstrapping of a non-seed node happens. Otherwise the bootstrap will
hang waiting on schema sync which will not happen.
Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>
This position will be used for mutation fragment
that represents the start of a partition.
This position sorts before static row.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It makes it easier to actually use those concepts.
Lambdas passed to mutation_fragment::visit have to declare
return type otherwise compiler fails with:
internal compiler error: Segmentation fault
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The gossiper checks if features should be enabled from its timer
callback when it detects that endpoint_state_map changed, that is
different than shadow_endpoint_state_map.
shadow_endpoint_state_map is also assigned from endpoint_state_map in
storage_service::replicate_tm_and_ep_map(), called from
storage_service::on_change()
Call gossiper:maybe_enable_features() in replicate_tm_and_ep_map so
that we won't miss gossip feature update.
Fixes#2824
* git@github.com:scylladb/seastar-dev asias/gossip_miss_feature_update_v1:
gossip: Move the _features_condvar signal code to
maybe_enable_features
gossip: Make maybe_enable_features public
storage_service: Check gossip feature update in
replicate_tm_and_ep_map
This is another place we can update endpoint_state_map in addition to
gossiper::run().
Call the gossiper:maybe_enable_features() so that we won't miss gossip
feature update.
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
It skipped one sub-range in each of the 10 range batch, and
tried to access the range vector using end() iterator.
Fixes sporadic failures of
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test.
Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>
"On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.
We shouldn't call fast_forward_to() with the cache region locked.
Fixes #2791."
* 'tgrabiec/dont-ffwd-in-alloc-section' of github.com:scylladb/seastar-dev:
cache_streamed_mutation: De-futurize cursor movement
cache_streamed_mutation: Call fast_forward_to() outside allocating section
cache_streamed_mutation: Switch from flags to explicit state machine
If we read a partition from a single sstable (a fairly common case),
we can bypass mutation_merger and just return the input.
Message-Id: <20170918181418.14021-1-avi@scylladb.com>
gossiper::apply_state_locally() calls handle_major_state_change() for
each endpoint, in a seastar thread, which calls mark_alive() for new
nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously
waits for each node to respond before it moves on to the next entry. As
a result it may take a while before whole state is processed.
Apache (tm) Cassandra (tm) sends echos in the background.
In a large cluster, we see at the time the joining node starts
streaming, it hasn't managed to apply all the endpoint_state for peer
nodes, so the joining node does not know some of the nodes yet, which
results in the joining node ingores to stream from some of the existing
nodes.
Fixes#2787Fixes#2797
Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>
When incremental_reader_selector is used for compaction, it will
first call incremental selector of partitioned sstable set with
minimum token that will result in first interval being skipped,
which means not everything being compacted. The interval is
skipped because iterator is incorrectly advanced when token
lies before it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918021446.15920-1-raphaelsc@scylladb.com>
The current sequential scan can take a long time on a small or empty table
with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to
exponential scan reduces the time.
Fixes#1230.
Message-Id: <20170912173803.8277-1-avi@scylladb.com>
Forward-declare untyped_result_set and untyped_result_set_row, and remove
the include from query_processor.hh.
Message-Id: <20170916170859.27612-3-avi@scylladb.com>
"sstables.hh is already too big, and it is soon to become bigger with the
inclusion of the read_monitor, to pair it with the write_monitor.
It's a good opportunity for us to reduce sstables.hh dependencies by
moving the write monitor to its own reader. One obvious caller is
already changed so we don't need to include sstables.hh anymore."
* 'progress-monitor' of https://github.com/glommer/scylla:
sstables: do not include sstables.hh from memtable glue
sstables: move write_monitor to its own header
log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
This patch change the implementation of storage_service
repair_async_status to throw an exception, this way a 400 return code
will be returned.
Fixes#2794
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170917080533.6612-1-amnon@scylladb.com>
- Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class.
- Add the corresponding metrics to the query_processor:
- Evictions count.
- Current entries count.
- Current memory footprint.
Fixes#2474
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This is a template class that implements caching of prepared statements for a given ID type:
- Each cache instance is given 1/256 of the total shard memory. If the new entry is going to overflow
this memory limit - the less recently used entries are going to be evicted so that the new entry could
be added.
- The memory consumption of a single prepared statement is defined by a cql3::prepared_cache_entry_size
functor class that returns a number of bytes for a given prepared statement (currently returns 10000
bytes for any statement).
- The cache entry is going to be evicted if not used for 60 minutes or more.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Ensure that the size of the cache is never bigger than the "max_size".
Before this patch the size of the cache could have been indefinitely bigger than
the requested value during the refresh time period which is clearly an undesirable
behaviour.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Sometimes we don't want the cached values to be periodically reloaded.
This patch adds the ability to control this using a ReloadEnabled template parameter.
In case the reloading is not needed the "loading" function is not given to the constructor
but rather to the get_ptr(key, loader) method (currently it's the only method that is used, we may add
the corresponding get(key, loader) method in the future when needed).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Current get(...) interface restricts the cache to work only with copy-constructable
values (it returns future<Tp>).
To make it able to work with non-copyable value we need to introduce an interface that would
return something like a reference to the cached value (like regular containers do).
We can't return future<Tp&> since the caller would have to ensure somehow that the underlying
value is still alive. The much more safe and easy-to-use way would be to return a shared_ptr-like
pointer to that value.
"Luckily" to us we value we actually store in a cache is already wrapped into the lw_shared_ptr
and we may simply return an object that impersonates itself as a smart_pointer<Tp> value while
it keeps a "reference" to an object stored in the cache.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Allow a variable entry size parameter.
Provide an EntrySize functor that would return a size for a
specific entry.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Get rid of the "proprietary" solution for asynchronous values on-demand loading.
Use utils::loading_shared_values instead.
We would still need to maintain intrusive set and list for efficient shrink and invalidate
operations but their entry is not going to contain the actual key and value anymore
but rather a loading_shared_values::entry_ptr which is essentially a shared pointer to a key-value
pair value.
In general, we added another level of dereferencing in order to get the key value but since
we use the bi::store_hash<true> in the hook and the bi::compare_hash<true> in the bi::unordered_set
this should not translate into an additional set lookup latency.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Since utils::loading_shared_values API is based on the original shared_index_list
this change is mostly a drop-in replacement of the corresponding parts.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This class implements an key-value container that is populated
using the provided asynchronous callback.
The value is loaded when there are active references to the value for the given key.
Container ensures that only one entry is loaded per key at any given time.
The returned value is a lw_shared_ptr to the actual value.
The value for a specific key is immediately evicted when there are no
more references to it.
The container is based on the boost::intrusive::unordered_set and is rehashed (grown) if needed
every time a new value is added (asynchronously loaded).
The container has a rehash() method that would grow or shrink the container as needed
in order to get the load factor into the [0.25, 0.75] range.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
There is no need to include the whole sstables.hh file in
memtable-sstable.hh anymore. All we need is the shared_sstable
definition and the progress monitor.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Soon I am about to introduce a read monitor, and pairing infrastructure
to manage it. Having it all living in sstables.hh force to include it
everytime, even in places that don't really need it.
Move to its own header.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.
We shouldn't call fast_forward_to() with the cache region locked.
Fixes#2791.
We're in one state at a time, so it's better to express it as a single
variable rather than N independent flags.
In preparation before adding more states.
Right now we pass a permit to the memtable writer and that permit is
used insite write_memtable_to_sstable to compose a write_monitor.
We would like to extend the write_monitor to include other things, that
right now are not available as parameters to write_memtable_to_sstable -
and which are possibly too specialized to be.
The solution for that is to pass the write_monitor instead of the permit
to the writer. Conceptually, that also makes sense because the
write_monitor is something the sstable writer is aware of. Permits, on
the other hand, are a database concept that is alien to the sstable
writer.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170915032836.21154-1-glauber@scylladb.com>
"When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema
for some reason, reads or writes which involve both 1.7 and 2.0 nodes may
start to fail with the following error logged:
storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d
The situation should heal after whole cluster is upgraded.
Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes
due to change in the schema tables format. Mismatch is meant to be avoided by
having 2.0 nodes calculate the old digest on schema migration during upgrade,
and use that version until next time the table is altered. It is thus not
allowed to alter tables during the rolling upgrade.
Two 2.0 nodes may exchange schema, if they detect through gossip that their
schema versions don't match. They may not match temporarily during boot, until
the upgraded node completes the bootstrap and propagates its new schema
through gossip. One source of such temporary mismatch is construction of new
tracing tables, which didn't exist on 1.7. Such schema pull will result in a
schema merge, which cause all tables to be altered and their schema version to
be recalculated. The new schema will not match the one used by 1.7 nodes,
causing reads and writes to fail, because schema requesting won't work during
rolling upgrade from 1.7 to 2.0.
The main fix employed here is to hold schema pulls, even among 2.0 nodes,
until rolling upgrade is complete."
* 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev:
tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case
tests: cql_test_env: Enable all features in tests
schema_tables: Make make_scylla_tables_mutation() visible
migration_manager: Disable pulls during rolling upgrade from 1.7
storage_service: Introduce SCHEMA_TABLES_V3 feature
schema_tables: Don't alter tables which differ only in version
schema_mutations: Use mutation_opt instead of stdx::optional<mutation>
If there is a schema pull during rolling upgrade among a two 2.0 nodes,
then schema merge will delete the persisted schema version. When the node
loads that table again, e.g. on restart, it will generate a version
which is different than the one which 1.7 nodes use. This will
cause reads and writes to fail.
To avoid this, disable pulls until all nodes are upgraded.
Fixes#2802.
We apply deletion of scylla_tables.version to the incoming schema
mutations so that table schema version is recalculated after merge.
The mutations which we read from local schema tables may not have it
deleted in which case all tables would be considered as differing on
the presence of the version field. Avoid this by deleting the field
from old mutations as well.
After the service started, a state of the service may become
"failed", "active" or "activating".
But our script does not accept scylla-ami-setup.service become "failed" state,
in result the script shows up wrong message.
So we handle these three types of state correctly.
Fixes#2759
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1504589079-1986-1-git-send-email-syuu@scylladb.com>
evict() doesn't guarantee that the whole partition is discontinuous.
In particular, partition tombstone cannot be marked as discontinuous.
The parts which are still continuous must be updated.
Broken after c78047fa5b.
Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>
Collect coordinator side read statistic per CF and use them in percentile
speculative read executor. Getting percentile from estimated_histogram
object is rather expensive, so cache it and recalculate only once per
second (or if requested percentile changes).
Fixes#2757
Message-Id: <20170911131752.27369-3-gleb@scylladb.com>
Currently overflow values are stored in incorrect bucket (last one
instead of special "overflow" one) and percentile() function throws
if there is overflow value. The patch fixes the code to store overflow
value in corespondent bucket and makes percentile() to take it into
account instead of throwing.
Message-Id: <20170911131752.27369-2-gleb@scylladb.com>
"This series fixes the problem of active reads causing OOM due to the fact that
partition snapshots they hold are not evictable. In particular, a single scan
of a partition larger than memory will bad_alloc due to itself.
After this, when partition entry is evicted from cache, data in all the snapshots
is also evicted. We still don't have row-level eviction, but this series lays some
grounds for it by making cache readers prepared for the possibility of rows
being evicted.
Fixes#2775.
Fixes #2730."
* tag 'tgrabiec/snapshot-evicition-in-cache-v1' of github.com:scylladb/seastar-dev:
tests: Add test for partition_entry::evict()
mutation_partition: Introduce range continuity checking methods
mutation_partition: Enable rows_entry::compare() on position_in_partition_views
tests: Extract mvcc tests to separate file
tests: row_cache: Add evicition tests
tests: simple_schema: Add new_tombstone() helper
tests: streamed_mutation_assertions: Introduce produces(mutation&)
streamed_mutation: Allow setting buffer capacity
row_cache: Evict partition snapshots
mvcc: Introduce partition_entry::evict()
row_cache: Handle eviction in partition reader
tests: row_cache_test: Don't assume mvcc snapshots are not evictable
row_cache: Reuse allocation_strategy::invalidate_references()
row_cache: Don't invalidate references on insertion
lsa: Move reclaim counter concept to allocation_strategy level
mvcc: Ensure partition_snapshot always destroys versions using proper allocator
mvcc: Encapsulate reference stability check in partition_snapshot
mvcc: Store LSA region reference in partition_snapshot
If snapshots are not evicted, they may pin unbouned amount of memory
for a long time in cache, which may lead to OOM. Evict snapshots
together with the entry.
Fixes#2775.
Fixes#2730.
The test was not updating the underlying mutation source but still
expecting to get the right data after calling invalidate(). If
snapshots are evictable, that's not guaranteed. Apply to underlying as
well, so data is read from underlying if necessary.
modification_count is currently only used to detect invalidation of
references, intended to be incremented on erasure.
Insertion into intrusive set doesn't invalidate references, so no
need to increment the counter.
partition_snapshot is managed by lw_shared_ptr. Currently it is
assumed that before it dies, maybe_merge_versions() is called on it,
which destroyes it in the right allocator context. It's not very
safe. This patch improves safety by using the right allocator in
snapshot's destructor.
This fixes a failure in view_schema_test, which starts many instances
of single_node_cql_env. cancel_atomic_deletions() causes later
deletions to fail, which causes some of the test cases to fail.
Message-Id: <1505311250-3118-2-git-send-email-tgrabiec@scylladb.com>
Fixes a hang on shutdown with --smp 2 in perf_fast_forward. The hang
is in sstables::await_background_jobs_on_all_shards(), which is
waiting on sstable deletions. Not all shards agree to delete certain
sstables, because e.g. not all shards decide to compact them
yet. Cancel those deletes after database is stopped on all shards,
like we do in main.cc
Fixes#2796.
Message-Id: <1505292239-26032-1-git-send-email-tgrabiec@scylladb.com>
Finds live objects on seastar heap of current shard which contain
given value. Prints results in 'scylla ptr' format.
Example:
(gdb) scylla find 0x600005321900
thread 1, small (size <= 512), live (0x6000000f3800 +48)
thread 1, small (size <= 56), live (0x6000008a1230 +32)
Message-Id: <1505284614-19577-1-git-send-email-tgrabiec@scylladb.com>
compaction_strategy.hh throws an exception, but it doesn't add the
exception header. It is working in-tree because of inclusion order,
but it broke one of my yet-out-of-tree changes.
In any case, it is best to add the headers we will need to the files,
and that is what this patch does.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170912233326.26114-1-glauber@scylladb.com>
filter.cc has just two smallish functions, which are part of the sstable
class. Move them to sstables.cc where the rest of the class members are defined.
Message-Id: <20170912080541.7836-1-avi@scylladb.com>
The test used apply() variant which assumed that it was invoked in a
seastar thread, which is no longer the case after commit d22fdf4. Fix
by copying outisde cache update, and use non-deferring apply() variant
for cache update.
Message-Id: <1505200142-3650-1-git-send-email-tgrabiec@scylladb.com>
This patchset reduces includes of sstables.hh, reducing compile time
by both reducing the amount of code compiled, and the amount of
needless recompiles caused by false dependencies. It does so by
replacing lw_shared_ptr<sstable>, which requires a complete class,
with a new custom type shared_sstable, which allows an incomplete
sstable class definition.
* https://github.com/avikivity/scylla deps2/v2.1
database: change truncate() to flush while compaction is disabled
database: make run_with_compaction_disabled() a non-template
database: add indirection to compaction_manager instance
database: remove dependency on compaction.hh and compaction_manager.hh
size_estimates_virtual_reader.hh: add missing include
system_keyspace: add missing include
main: add missing include
storage_service: add missing include
repair: add missing include
compaction.hh: add missig include and forward declaration
compaction_manager: add missing include
shared_index_lists.hh: add missing include
perf_fast_forward: add missing include
sstable_mutation_test: add missing include
sstables: extract version and format enum into a separate header file
database.hh: add missing forward declaration for
foreign_sstable_open_info
cql_test_env: add forward declaration
database: make column_family::disable_sstable_write() out-of-line
sstables: introduce make_sstable() as a shortcut for
make_lw_shared<sstable>
treewide: use shared_sstable, make_sstable in place of
lw_shared_ptr<sstable>
sstables: use support for lw_shared_ptr with incomplete type for
shared_sstable
sstables: reduce dependencies
streaming: remove unneeded includes
When table is created, it doesn't contain any data, so we can mark the whole
data range as continuous in cache. This way reads will immediately hit, and
flushes will populate. If sstables are later attached, the attaching process
is supposed to invalidate affected ranges (and it does).
Fixes#2536.
Message-Id: <1505200269-4031-1-git-send-email-tgrabiec@scylladb.com>
Use the lw_shared_ptr deleter support to define shared_sstable without
pulling the definition of class sstable, reducing compile time and
dependencies if only shared_sstable is needed.
The parser generator somehow confuses the use-after-scope sanitizer, causing it
to use large amounts of stack space. Disable that sanitizer on that file.
Message-Id: <20170905110628.18047-1-avi@scylladb.com>
In preparation to make run_with_compaction_disabled() a non-template,
we want to remove any non-copyable captures (so the function can be
an std::function, which requires copyability). Move the flush within
the compaction disabled region. This changes the behavior, but it shouldn't
matter.
"These patches make Scylla refuse to load counter sstables that may
contain unsupported counter shards. They are recognised by the lack of
the Scylla component.
Fixes #2766."
* tag 'reject-non-scylla-counter-sstables/v1' of https://github.com/pdziepak/scylla:
db: reject non-Scylla counter sstables in flush_upload_dir
db: disallow loading non-Scylla counter sstables
sstable: add has_scylla_component()
A keyspace can be deleted while write is ongoing, so the object cannot
be used after defer point. The keyspace reference is only used to check
how many replies a write operation should wait for and this can be
precalculated during write handler creation.
Fixes#2777
Message-Id: <20170911084436.GG24167@scylladb.com>
"This series tries to improve the bootstrap of a node in a large cluster by
improving how gossip applies the gossip node state. In #2404, the joining node
failed to bootstrap, because it did not see the seed node when
storage_service::bootstrap ran. After this series, we apply the whole gossip
state contained in the gossip ack/ack2 message before applying the next one,
and we apply the state of the seed node earlier than non-seed node so we can
have the seed node's state faster. We also add some randomness to the order of
applying gossip node state to prevent some of the nodes' state are always
applied earlier than the others.
This series improves apply_state_locally for large cluster:
- Tune the order of applying endpoint_state
- Serialize apply_state_locally
- Avoid copying of the gossip state map
Fixes#2404"
* tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev:
gossip: Avoid copying with apply_state_locally
gossip: Serialize apply_state_locally
gossip: Tune the order of applying endpoint_state in apply_state_locally
gossip: Introduce is_seed helper
gossip: Pass const endpoint_state& in notify_failure_detector
gossip: Pass reference in notify_failure_detector
Move the std::map<inet_address, endpoint_state> map from the gossip
ack/ack2 message directly and move it around in apply_state_locally to
avoid copying the map.
apply_state_locally will be called when gossip ack/ack2
message is received. It will use the std::map<inet_address,
endpoint_state>& map to update the endpoint state.
However, we can receive multiple such gossip ack/ack2 messages from
multiple peer nodes in parallel. Currently, we process them in parallel.
It is better to apply all the states from one node then move to apply
all the states from another node than interleaving. Because it is more
important to have the state of the whole cluster than to have a bit
newer state from another peer (if it is newer), especially when the node
boots up and runs its first round of gossip exchange.
After this patch, we apply the whole gossip state contained in the
gossip ack/ack2 message before applying the next one.
We currently always apply the endpoint_state in the order of the
endpoint ip address. This is not good because some of the endpoint's
state is always applied earlier than the others.
In large cluster, the number of endpoints can be large, it takes time to
apply all of them. To make it more fair, we apply the endpoint_state
randomly.
Apply the seed node's state earlier because in bootstrap, we will check
if we have seen the seed node in storage_service::bootstrap. In #2404,
the bootstrap failed because, the joining node hasn't apply the seed
node's state when storage_service::bootstrap runs.
* seastar 85ca12d...31b925d (19):
> net/byteorder: fix 64 bit ntohq and htonq on big endian machines
> core, util: fix compilation on non-x86 processors
> core/memory: Fix SIGSEGV in small_pool::add_more_objects()
> log: remove debug leftovers
> Merge "TLS state machine fixes" from Calle
> logger: allow adjusting the timestamp style for stdout logs
> thread: make thread_context::s_main portable
> core: add seastar::cache_line_size constant
> Add detach() to input_stream and output_stream
> Install dependencies for Arch Linux.
> tls: Guard non-established sockets in sesrefs + more explicit close + states
> tls: Make vec_push fully exception safe
> basic_sstring: resize uses sstring
> Merge "Add and correct unit tests" from Jesse
> tcp: enforce 1-byte maximum segment invariant with zero window
> tcp: verify 1-byte maximum segment invariant during send with zero window
> memory: reduce small_pool vulnerability to fragmentation further
> Prometheus: avoid merging all metrics family
> net: Fix possible NULL pointer dereference.
validate_request_failure() assumed that the future returned by execute_cql()
is always ready, which doesn't have to be the case, and caused aborts
in debug mode build.
Message-Id: <1504701342-13300-1-git-send-email-tgrabiec@scylladb.com>
Scylla already refuses to load counter sstables that do not have Scylla
component. However, if this happens because of 'nodetool refresh'
command the existing protection will trigger after sstables have been
moved to the data directory. This is too later, so an additional check
is added when the upload directory is scanned.
has_scylla_component() is going to be used to verify that an sstable has
been generated by a recent version of Scylla. This would make it
possible to reject sstables that may be unsafe to load (e.g. sstables
containing legacy counter shards).
"This series implements the missing API to terminate all repairs.
For example:
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/storage_service/force_terminate_repair"
With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.
On top of this, we can support termination of single repair job instead all
jobs.
Fixes#2105"
* tag 'asisas/repair_abort_v4' of github.com:scylladb/seastar-dev:
repair: Support termination of repair jobs
repair: Track repair_info
repair: Intorduce repair id to repair_info map
api: Add force_terminate_repair API
streaming: Add abort to stream_plan
streaming: Add abort_all_stream_sessions for stream_coordinator
streaming: Introduce streaming::abort()
streaming: Make stream_manager and coordinator message debug level
streaming: Check if _stream_result is valid
streaming: Log peer address in on_error
streaming: Introduce received_failed_complete_message
"Scylla 1.7.4 and older use incorrect ordering of counter shards, this
was fixed in 0d87f3dd7d ("utils::UUID:
operator< should behave as comparison of hex strings/bytes"). However,
that patch was not backported to 1.7 branch until very recently. This
means that versions 1.7.4 and older emit counter shards in an incorrect
order and expect them to be so. This is particularly bad when dealing
with imported correct sstables in which case some shards may become
duplicated.
The solution implemented in this patch is to allow any order of counter
shards and automaticly merge all duplicates. The code is written in a
way so that the correct ordering is expected in the fast path in order
not to excessively punish unaffected deployments.
A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless
upgrade from 1.7.4 to later Scylla versions. If that feature is not
available Scylla still writes sstables and sends on-wire counters using
the old ordering so that it can be correctly understood by 1.7.4, once
the flag becomes available Scylla switches to the correct order.
Fixes #2752."
* tag 'fix-upgrade-with-counters/v2' of https://github.com/pdziepak/scylla:
tests/counter: verify counter_id ordering
counter: check that utils::UUID uses int64_t
mutation_partition_serializer: use old counter ordering if necessary
mutation_partition_view: do not expect counter shards to be sorted
sstables: write counter shards in the order expected by the cluster
tests/sstables: add storage_service_for_tests to counter write test
tests/sstables: add test for reading wrong-order counter cells
sstables: do not expect counter shards to be sorted
storage_service: introduce CORRECT_COUNTER_ORDER feature
tests/counter: test 1.7.4 compatible shard ordering
counters: add helper for retrieving shards in 1.7.4 order
tests/counter: add tests for 1.7.4 counter shard order
counters: add counter id comparator compatible with Scylla 1.7.4
tests/counter: verify order of counter shards
tests/counter: add test for sorting and deduplicating shards
counters: add function for sorting and deduplicating counter cells
counters: add counter_id::operator>
Until the cluster is fully upgraded from a version that uses the
incorrect counter shard ordering it is essential to keep using it lest
the old nodes corrupt the data upon receiving mutations with a counter
shard ordering they do not expect.
If the feature signaling that we have switched to the correct ordering
of counter shards is not enabled it means that the user still can do a
rollback to a version that expects wrong ordering. In order to avoid any
disasters when that happens write sstables using the 1.7.4 order until
we know for sure that it is no longer needed.
Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix
this problem a new feature is introduced that will be used to determine
when nodes with that bug fixed can start sending counter shard in the
correct order.
Due to a bug in an implementation of UUID less compare some Scylla
versions sort counter shards in an incorrect order. Moreover, when
dealing with imported correct data the inconsistencies in ordering
caused some counter shards to become duplicated.
* 'tgrabiec/make-row-cache-update-exception-safe' of github.com:scylladb/seastar-dev:
row_cache: Improve safety of cache updates
row_cache: Extract invalidate_sync()
memtable: Mark mark_flushed() as noexcept
database: Add non-throwing try_trigger_compaction()
database: Make add_sstable() have strong exception guarantees
row_cache: Don't require presence checker to be supplied externally
database: Supply presence checker in sstable snapshots
mutation_source: Introduce mutation_source::make_partition_presence_checker()
mutation_reader: Move definitions up in the header
mutation_reader: Use constructor delegation to reduce code duplication
row_cache: Make populate() preserve continuity
row_cache: Allow marking as fully continuous on construction
database: Add missing serialization of sstable set udpate and cache invalidation
Cache imposes requirements on how updates to the on-disk mutation source
are made:
1) each change to the on-disk muation source must be followed
by cache synchronization reflecting that change
2) The two must be serialized with other synchronizations
3) must have strong failure guarantees (atomicity)
Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.
Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail. The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.
In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.
This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.
The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.
The database::_cache_update_sem goes away, serialization is done
internally by the cache.
The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.
Fixes#2754.
Also fixes the following test failure:
tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed
which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
Every mutation source can have a presence checker. By default all
answer "maybe contains".
Having this on mutation_source level will be useful for simplifying
cache update flow. The cache can ask the right snapshot for a presence
checker rather than relying on database to know when and how to make
the right one which preserves all invariants.
This will be especially useful once all updates of the underlying
mutation source of cache (e.g. sstable list) will have to go through
cache for safety reasons.
Commit e3ad676433 missed a few places.
It is required to serialize sstable list update and cache synchronization
in order to preserve partition update isolation.
Fixes#2746.
* dist/ami/files/scylla-ami b41e5eb...5ffa449 (3):
> amzn-main.repo: stick to Amazon Linux 2017.03 kernel (4.9.x)
> Prevent dependency error on 'yum update'
> scylla_create_devices: don't raise error when no disks found
This was part of "add gate for generic async operations to column family" but
somehow didn't make it into the final patch.
Add the missing piece.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170830164205.4497-1-glauber@scylladb.com>
"optional interposer that will check integrity of writes to sstable components.
The option name is enable_sstable_data_integrity_check, it's disabled by default
and can be enabled via config file.
It will provide enough details that will help to find the root of the issue.
if disk failed for example, we would've something like the following reported:
ERROR 2017-08-17 09:18:11,577 [shard 0] sstable - integrity check failed for ./data/data/system_schema/aggregates-924c55872e3a345bb10c12f37c1ba895/system_schema-aggregates-ka-111-Scylla.db, stage: read after write verification, write: 4096 bytes to offset 0, reason: data read from underlying storage isn't the same as written, mismatch at byte 0:
data written sample: 10000001000000010000001a00000001
date read sample: 00000000000000000000000000000000"
* 'integrity_check_interposer_v3' of github.com:raphaelsc/scylla:
sstables: optionally check integrity of sstable component writes
sstables: remove unneeded new_sstable_component_file variant
db/config: add sstable_data_integrity_check option
sstables: introduce file interposer for integrity check
If config file's sstable_data_integrity_check option is enabled,
new integrity check interposer will be used in addition to the
existing one. Performance is expected to drop because of all
the integrity checks for every write. This new interposer will
provide detailed info when integrity check fails.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
optional interposer that will check integrity when writing to
sstable components. It will provide enough details that will
help to find the root of the issue, which may come from lower
level layers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This patch implements the missing API to terminate all repairs.
For example:
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/storage_service/force_terminate_repair"
With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.
Fixes#2105
The maps are stored in a vector. The vector has smp::count elements, each
element will be accessed by only one shard.
The add_repair_info, remove_repair_info and get_repair_info helpers
are added.
The api /storage_service/force_terminate is supposed to be
/storage_service/force_terminate_repair.
scylla-jmx uses /storage_service/force_terminate api.
So instead of renaming it, it is better to add a new name for it.
When we abort a session, it is possible that:
node 1 abort the session by user request
node 1 send the complete_message to node 2
node 2 abort the session upon receive of the complete_message
node 1 sends one more stream message to node 2 and the stream_manager
for the session can not be found.
It is fine for node 2 to not able to find the stream_manager, make the
log on node 2 less verbose to confuse user less.
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.
This reverts commit b56ba02335.
After commit 8fa35d6ddf (messaging_service: Get rid of timeout and retry
logic for streaming verb), streaming verb in rpc does not check if a
node is in gossip memebership since all the retry logic is removed.
Remove the extra wait before removing the joining node from gossip
membership.
Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>
Thread stack allocation may fail, in which case we did not do the
necessary invalidation. Fix by hoisting the scope of the cleanup function.
Also fixes the following test failure:
tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed
which started to trigger after commit 318423d50b.
Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>
With the "Use range_streamer everywhere" (7217b7ab36) seires, all
the user of streaming now do streaming with relative small ranges and
can retry streaming at higher level.
There are problems with timeout and retry at RPC verb level in streaming:
1) Timeout can be false negative.
2) We can not cancel the send operations which are already called. When
user aborts the streaming, the retry logic keeps running for a long
time.
This patch removes all the timeout and retry logic for streaming verbs.
After this, the timeout is the job of TCP, the retry is the job of the
upper layer.
Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>
- Removed text from Report's "PURPOSE" section, which was referring to the "MANUAL CHECK LIST" (not needed anymore).
- Removed curl command (no longer using the api_address), instead using scylla --version
- Added -v flag in iptables command, for more verbosity
- Added support to for OEL (Oracle Enterprise Linux) - minor fix
- Some text changes - minor
- OEL support indentation fix + collecting all files under /etc/scylla
- Added line seperation under cp output message
Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828131429.4212-1-tomer@scylladb.com>
Large deques require contiguous storage, which may not be available (or may
be expensive to obtain). Switch to new custom container instead, which allocates
less contiguous storage.
Allocation problems were observed with the summary and compression info. While
there is work to reduce compression info contiguous space use, this solves
all std::deque problems (and should not conflict with that work).
Fixes#2708
* tag '2708/v6' of https://github.com/avikivity/scylla:
sstables: switch std::deque to chunked_vector
tests: add test for chunked_vector
utils: add a new container type chunked_vector
"Fixes #2734."
* 'tgrabiec/make-sstable-reader-work-with-empty-range-set' of github.com:scylladb/seastar-dev:
tests: Introduce clustering_ranges_walker_test
tests: simple_schema: Add missing include
sstables: reader: Make clustering_ranges_walker work with empty range set
clustering_ranges_walker: Make adjacency more accurate
Such queries can be issued by counter updates which involve only
static row.
Causes failure in test_query_only_static_row invoked from
sstable_mutation_test. See commit 6572f38, which fixed the problem in
cache reader.
Fixes#2734.
Current check considered some adjacent range tombstones as overlapping
with the ranges. Making this more accurate will become more important
after we will rely on putting p_i_p::after_all_clustered_rows() in
_current_start in out-of-range state.
Previously _current_bucket_segment_index was used differently depending on
whether update_position_trackers() is used in a random or sequential
access. In the former case was used as the absolute index of the segment
(independent of the buckets) and in the latter as the relative index of
the segment within its bucket. This caused problems when there was a
switch between random and sequential access, meaning one could get different
results for an at() call depending on what was the previous at() call.
Fix this by consistently using _current_bucket_segment_index as - like its
name suggest - the bucket relative segment index.
Ref #1946.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7f68ac1d32c80e8dea6dfa11be02acaa961bce2a.1503924927.git.bdenes@scylladb.com>
* 'deps1/v1' of https://github.com/avikivity/scylla:
types.hh: extract marshal_exception from types.hh into a new file
utils: remove dependency on types.hh
locator: add missing include "log.hh"
supervisor: remove dependency on init.hh
tracing: add missing include "log.hh"
gms: remove unneeded #include "types.hh"
* 'tgrabiec/fix-fast-forwarding' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Add more tests for fast forwarding across partitions
sstables: Fix abort in mutation reader for certain skip pattern
sstables: Fix reader returning partition past the query range in some cases
sstables: Introduce data_consume_context::eof()
The problem happens for the following sequence of events:
1) reader stops in the middle of some partition before it
skips to another partition range
2) reader is fast forwarded to a partition range which has no data in
the sstable. There are some partitions between the previous
partition range and the one we skip to
3) the reader is asked for next partition
The problem was that mutation_reader::fast_forward_to() was putting
the reader in _read_enabled == false state in step 2, but
data_consume_context was not fast forwarded to the range. When in step
3 we were asked for the next partition, we attempted to skip using
index (because of 1). The result of the skip was some position which
is outside of the current range of data_consume_context, which causes
it to abort. To fix, add a check for _read_enabled before we try to
skip.
If index was used to skip to the next partition (because the current
partition wasn't consumed in full) and reader's partition range ends
before the data file ends, we did not detect that we're out of range
before returning a streamed_mutation. Fix by checking _context.eof()
before doing that.
Refs #2733.
For better or worse, marshal_exception is used from utils/, and it's not good
to have utils/ depend on types.hh. Extract marshal_exception to make it possible
to remove the dependency.
* seastar 0083ee8...85ca12d (1):
> Merge "Run-time logging configuration" from Jesse
Includes patch from Jesse:
"Switch to Seastar for logging option handling
In addition to updating the abstraction layer for Seastar logging in `log.hh`,
the configuration system (`db/config.{hh,cc}`) has been updated in two ways:
- The string-map type for Boost.program_options is now defined in Seastar.
- A configuration value can be marked as `UsedFromSeastar`. This is like `Used`,
except the option is expected to be defined in the Boost.Program_options
description for Seastar. If the option is not defined in Seastar, or it is
defined with a different type, then a run-time exception is thrown early in
Scylla's initialization. This is necessary because logging options which are
now defined in Seastar were previously defined in Scylla and support for these
options in the YAML file cannot be dropped. In order to be able to verify that
options marked `UsedFromSeastar` are actually defined in Seastar, the
interface for adding options to `db::config` has changed from taking a
`boost::program_options::options_description_easy_init` (which is handle into
a `boost::program_options::options_description` which only allows adding
options) to taking a `boost::program_options::options_description`
directly (which also allows querying existing options).
Scylla also fully defers to Seastar's support for run-time logging
configuration."
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <ef26cffb91bef1ae95d508187a6dd861a6c4fc84.1503344007.git.jhaberku@scylladb.com>
"We have recently found two problems with the drop_column_family code
that needs addressing. The first is that exceptions in truncate() may
lead to stop() being skipped, which can cause Scylla to crash.
The other is that a truncate() issued before drop_column_family may get
the chance to execute only after the column family is already dropped
and also crash (That is issue 2726).
The second problem is the classic problem of asynchronous execution on
an object that may terminate, which we have been traditionally solving
with a gate. We add a gate to the column family that will be closed
during CF stop(), and we will require all asychronous operations to
enter it.
The immediate fix is for truncate(), where we have seen a real, concrete
problem. But it would be good to audit other code paths to make sure
that they are sane.
The most obvious ones, flush, compaction and sstable deletion are
already sane, since they are waited on explicitly during stop()."
Fixes#2726.
* 'issue-2726-v2-master' of github.com:glommer/scylla:
database: add gate for generic async operations to column family
database: make sure that column family is always stopped when dropped
We currently use std::deque<> for when we need large random-access containers,
but deque<> requires nr_items * sizeof(T) / 64 bytes of contiguous memory, which can
exceed our 256k fragmentation unit with large sstables. The new
container, which is a cross between deque and vector, has much lower
limitations.
Like deque, we allocate chunks of contiguous items, but they are
128k in size instead of 512. The last chunk can be smaller to avoid
allocating 128k for a really small vector.
run_with_compaction_disabled(), which is called by truncate, has a
pretty large defer point in remove(). When the code gets to finally
execute, we can't guarantee that the column family will still be alive.
That is true in particular if we issued a drop table command following
truncate: by the time truncate gets to resume, the CF will be gone.
Before the column family is dropped, it will always call its stop()
method, which means we have an opportunity to do some waiting there. We
already wait for flushes and current compactions to end.
Traditionally, we have been solving similar problems by adding a gate
that will catch asynchronous operations and making sure that potentially
asynchronous operations will enter the gate before executing. Let's do
the same thing here. We will close() the gate during stop().
Fixes#2726
Signed-off-by: Glauber Costa <glauber@scylladb.com>
truncate can throw exceptions. If it does, cf->stop() will never be
called because it is contained in a .then clause instead of finally.
One of the things that truncate does - in a finally block of its own -
is initiate a final compaction. If it returns an exception nobody will
wait for that compaction to finish (since cf->stop() is the one doing
that) and we'll crash.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"Preserve the networking configuration mode during the upgrade by generating the /etc/scylla.d/perftune.yaml
file and using it."
Fixes#2725.
* 'dist_respect_cpuset_conf-v3' of https://github.com/vladzcloudius/scylla:
scylla_prepare: respect the cpuset.conf when configuring the networking
scylla_cpuset_setup: rm perftune.yaml
scylla_cpuset_setup: add a missing "include" of scylla_lib.sh
Choose the networking configuration mode according to the current contents of /etc/scylla.d/cpuset.conf.
If it doesn't exist - use the default mode.
If it exists - use the mode that has been used for generation of the CPU set.
Store the configuration into the /etc/scylla.d/perftune.yaml
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
scylla_setup resets our configuration and perftune.yaml is a part of it.
perftune.yaml is generated based on the contents of cpuset.conf therefore we should reset
these together.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The scylla_cpuset_setup uses a verify_args() function that is defined in the scylla_lib.sh.
Fixes#2716
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
* seastar e96881a...b9f1eb7 (9):
> httpd: indentation patch
> httpd: handle exception when shutting down
> stall-detector: Allow backtrace throttling to be configured
> stall-detector: Fix messages about suppresssion not appearing
> scripts: posix_net_conf.sh: allow passing a perftune.py configuration file as a parameter
> scripts: perftune.py: add the possibility to pass the parameters in a configuration file and print the YAML file with the current configuration
> scripts: perftune.py: actually use the number of Rx queues when comparing to the number of CPU threads
> core: make current_backtrace() noexcept
> memory: add large allocation detector stubs for default allocator
Fixes#2723.
* tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev:
repair: Do not allow repair until node is in NORMAL status
gossip: Add is_normal helper
The following backtrace was reported by user when running repair and keeping restarting the node at the same time.
#0 0x00007eff077281d7 in raise () from /lib64/libc.so.6
#1 0x00007eff07729a08 in abort () from /lib64/libc.so.6
#2 0x00007eff07721146 in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007eff077211f2 in __assert_fail () from /lib64/libc.so.6
#4 0x00000000010ef2c2 in locator::token_metadata::first_token_index (this=0x641000214e98, start=...) at locator/token_metadata.cc:133
#5 0x00000000010ef2d9 in locator::token_metadata::first_token (this=0x641000214e98, start=...) at locator/token_metadata.cc:143
#6 0x00000000010e329d in locator::abstract_replication_strategy::get_natural_endpoints (this=0x641000494000, search_token=...)
at locator/abstract_replication_strategy.cc:66
#7 0x0000000001481186 in get_neighbors (hosts=std::vector of length 0, capacity 0, data_centers=std::vector of length 0, capacity 0,
range=<error reading variable: access outside bounds of object referenced via synthetic pointer>, ksname=..., db=...) at repair/repair.cc:196
#8 repair_range<nonwrapping_range<dht::token> > (range=..., ri=...) at repair/repair.cc:781
#9 <lambda(auto:99&)>::<lambda(auto:100&&)>::<lambda(auto:101&)>::<lambda()>::operator() (__closure=0x7efec07f7460) at repair/repair.cc:1005
#10 futurize<future<bool_class<stop_iteration_tag> > >::apply<repair_ranges(repair_info)::<lambda(auto:99&)>::
It is reproduced with
1) while true; do curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/ks3"; done
2) start node 127.0.0.1, stop node 127.0.0.1 in a loop
The problem is, during boot up, the token_metadata is not replicated to all shards until
the node goes into NORMAL status.
To fix, check until node is in NORMAL status before allowing repair.
Fixes#2723
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.
This can cause a performance issue both on the reporting system and on
the collecting system.
This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.
Fixes#2701
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
"Overly large metadata can hog memory which especially hurts in setups
with bad disk/memory ratio. To ease the pain compress the in-memory
compression-info.
The compression is implemented based on Avi's idea which is to group n
offsets together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size). Also offsets are allocated
only just enough bits to store their maximum value. The offsets are thus
packed in a buffer like so:
arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.
This of course means that stored offsets will not be aligned, not even
on a byte boundary, but the size reduction pretty convincing.
In addition, segments are stored in buckets, where each bucket has its
own base offset. In addition, segments in a buckets are optimized to
address as large of a chunk of the data as possible for a given chunk
size."
Ref #1946.
* 'bdenes/compress-compression-v3' of https://github.com/denesb/scylla:
Add unit test for compress::offsets
Optimise the storage of compression chunk offsets
Add script to precompute segmented compression parameters
To reduce the memory footprint of compression-info, n offsets are
grouped together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size). Also offsets are
allocated only just enough bits to store their maximum value. The
offsets are thus packed in a buffer like so:
arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.
The optimal value of n can be calculated for a given file_size (f) and
chunk_size (c), by finding the minima of the following function:
f(n) = (f/c)/n * (log2(f) + (n - 1)*log2((n-1)*(c + 64)))
This is done in an empirical way, using a script (see below).
Furthermore segments are stored in buckets, where each bucket has its
own base offset. Each bucket therefore can address an equal chunk of the
file and furthermore each segment in a bucket can address an equal
sub-chunk of this area.
The value of a given offset i is thus:
bucket_base_offset_for(i) + segment_base_offset_for(i) + offset(i)
To account for the bucketed storage we calculate a local_f, which is
optimized so that a bucketful of segmented offsets can address the
largest possible chunk of f. As value of this local_f only depends on
the bucket_size (b) and c the value of n can be made independent of f
and therefore only depend on one dynamic value, c. This makes life much
simpler as we don't need to know the size of the file up-front, we can
just append buckets to the storage on demand, while the required storage
is still less than a third [1] of the original storage requirements
(std::deque<uint64>).
The table with the minima(f(n)) for different f and c values is
pre-computed by gen_segmented_compress_params.py and
stored in sstables/segmented_compress_params.hh. This script also
creates a table with the best values of local_f for the given
bucket_size. At runtime we only select the best params based on c.
[1] This was calculated for c=4K and b=4K
Some (most?) users don't read logs or release notes, so they won't notice
that the ByteOrdered and Random partitioners were deprecated in 2.0. Make
them notice by refusing to start with a deprecated partitioner, unless a
switch is explicitly enabled.
Message-Id: <20170820073424.8331-1-avi@scylladb.com>
Limiting summary entry generation to at most one summary entry
per 64k of index data can lead to large index pages, with thousands
of index entries per summary entry. These are slow to parse, and there
is no real gain from the limit, since we already enforce a size
limit on the summary.
Remove the limit and allow summary entry generation based solely on
spanned data size.
Fixes#2711.
Message-Id: <20170819184255.14181-1-avi@scylladb.com>
split_range_to_single_shard() returns a vector of size 4096, with
each element (a partition_range) of size 100. The total of 400k can
cause defragmentation if memory is fragmented.
Fix by using a deque.
Fixes#2707.
Message-Id: <20170819141017.28287-1-avi@scylladb.com>
The script generates sstables/segmented_compress_params.hh which
contains a list with the optimal number of grouped offsets for
different data and chunk sizes as well as a list with the best
nominal data sizes for different chunk sizes, given a bucket size.
Data sizes are in the range of [2**4,2**50] and chunks in the
range of [2**4, 2**30]. Data sizes that are not used with the current
bucket_size are ommited.
See next commit for details of how the calculated values are used.
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.
Message-Id: <20170819135212.25230-1-avi@scylladb.com>
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.
Fixes#2671.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
"Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned.
This series optimizes it, in case it is needed, and also changes the
result message to include the partition and row counts, avoiding the
calculation altogether."
* 'calculate-counts/v3' of github.com:duarten/scylla:
query-result: Send row and partition count over the wire
query::result: Optimize calculate_counts()
These two admin related packages will be packaged under the "app-admin"
category and not the "dev-db" one.
This fixes the detection path of the packages for scylla_setup.
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817094756.21550-1-ultrabug@gentoo.org>
Scylla's configure.py contains stuff we copied from Seastar's
configure.py, but is no longer used. Let's get rid of some of it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170813150842.12603-1-nyh@scylladb.com>
When user mistakenly forgot to pass parameter for a flag, our scripts misparses
next flag as the parameter.
ex) Correct usage is '--ntp-domain <domain> --setup-nic', but passed
'--ntp-domain --setup-nic'.
Result of that, next flag will ignore by scripts.
To prevent such behavior, reject any parameter that start with '--'.
Fixes#2609
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170815114751.6223-1-syuu@scylladb.com>
When compacting a partition for querying we would read an extra row,
to include any tombstones between that one and the previous row.
This is no longer needed since we have a general mechanism to detect
short reads in the storage_proxy.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811103031.22866-1-duarte@scylladb.com>
incremental_reader_selector assumes the partition_range it receives has a lower
bound, but it was seen in mutation_test that this is not so.
Fix by checking whether the bound exists or not.
Message-Id: <20170815095852.14149-1-avi@scylladb.com>
"Exhausted readers belonging to a combined_mutation_reader can be fast
forwarded, so we have to keep them around. However, if the reader is
not fast forwardable, then we can drop the contained readers and their
buffers."
* 'ff-reader/v2' of github.com:duarten/scylla:
combined_mutation_reader: Drop exhausted readers if not in FF mode
combined_mutation_reader: Remove superfluous mutation_readers list
memtable_snapshot_source: Created readers should be fast forwardable
Exhausted readers can be fast forwarded, so we have to keep them
around. However, if the current reader is not fast forwardable, then
we can drop those readers and their buffers.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned. This patch makes it a bit faster.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar 7a49ae5...edb73ab (11):
> scripts: perftune.py: change the network module mode auto selection heuristic
> net/tls: explicitly ignore ready future during shutdown
> Use python2 explicitly as an interpreter for Python v2 scripts
> peering_sharded_service: prevent over-run the container
> Add link to documentation to the README.md
> Add guidelines for contributing to Seastar
> sharded: fix move constructor for peering_sharded_service services
> Provide a convenient way to lazy-convert to string the values of pointers
> tutorial: overhaul semaphores section
> simple-stream: Make fragmented::write_substream return simple if possible
> simple-stream: Make simple/fragmented memory output stream top level
quantity prevents index_reader from reading all index entries of a summary
entry that span more than min_index_interval entries. That can happen after
introduction of size-based sampling, and consequently, sstable will not be
able to return a key which logical position in summary entry is beyond
min_index_interval. It's ok to not use quantity because index_reader will
read all indexes until either next summary entry or end of file is reached.
Fixes test_sstable_conforms_to_mutation_source
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>
"Fixes #1842."
* 'size_based_sampling_v3' of github.com:raphaelsc/scylla:
tests: test summary entry spanning more keys than min interval
db/config: introduce sstable_summary_ratio option
sstables: introduce size-based sampling for sstable summary
sstables: make components_writer::offset const qualified and uint64_t
sstables: make writer::offset const qualified and uint64_t
"This patch series adds support for the `duration` type in CQL, which
was added to Cassandra in 3.10.
As part of this work, it was necessary also to add support for the
`vint` and `unsigned vint` types to the native protocol implementation,
which are part of v5 of the specification.
To test interactively, it is necessary to use cqlsh distributed with
Cassandra, as the version we distribute does not yet support the
duration type."
* 'jhk/duration_protocol/v5' of https://github.com/hakuch/scylla:
Support `duration` CQL native type
CQL native protocol: Add support for `vint` serialization
duration_test.cc: Add test for printing zero duration
duration.cc: Remove nop `const` qualifier on return type
Change `const` qualifier declaration order for `duration`
duration.cc: Simplify range checking
Rename `duration` to `cql_duration`
Now that we don't go directly to reconciliation for range queries, the
result isn't required to have the row and partition counts calculated
(we no longer transform a reconciled_result to a query::result).
Furthermore, this line was causing a lot of dtests to fail on account
of them not expecting an error line in the logs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170810225351.12610-1-duarte@scylladb.com>
Currently, a summary entry is added after min_index_interval index
entries were written. Not taking into account size of index entries
becomes a problem with large partitions which may create big index
entries due to promoted indexes. Read performance is affected as a
consequence because index entries spanned by summary are all read
from disk to serve request.
What we wanna do is to also add a summary entry after index reaches
a boundary. To deal with oversampling, we want to write 1 byte to
summary for every 2000 bytes written to data file (this will be
eventually made into an option in the config file).
Both conditions must be met to avoid under or oversampling.
That way, the amount of data needed from index file to satify the
request is drastically reduced.
Fixes#1842.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
`duration` is a new native type that was introduced in Cassandra 3.10 [1].
Support for parsing and the internal representation of the type was added in
8fa47b74e8.
Important note: The version of cqlsh distributed with Scylla does not have
support for durations included (it was added to Cassandra in [2]). To test this
change, you can use cqlsh distributed with Cassandra.
Duration types are useful when working with time-series tables, because they can
be used to manipulate date-time values in relative terms.
Two interesting applications are:
- Aggregation by time intervals [3]:
`SELECT * FROM my_table GROUP BY floor(time, 3h)`
- Querying on changes in date-times:
`SELECT ... WHERE last_heartbeat_time < now() - 3h`
(Note: neither of these is currently supported, though columns with duration
values are.)
Internally, durations are represented as three signed counters: one for months,
for days, and for nanoseconds. Each of these counters is serialized using a
variable-length encoding which is described in version 5 of the CQL native
protocol specification.
The representation of a duration as three counters means that a semantic
ordering on durations doesn't exist: Is `1mo` greater than `1mo1d`? We cannot
know, because some months have more days than others. Durations can only have a
concrete absolute value when they are "attached" to absolute date-time
references. For example, `2015-04-31 at 12:00:00 + 1mo`.
That duration values are not comparable presents some difficulties for the
implementation, because most CQL types are. Like in Cassandra's implementation
[2], I adopted a similar strategy to the way restrictions on the `counter` type
are checked. A type "references" a duration if it is either a duration or it
contains a duration (like a `tuple<..., duration, ...>`, or a UDT with a
duration member).
The following restrictions apply on durations. Note that some of these contexts
are either experimental features (materialized views), or not currently
supported at run-time (though support exists in the parser and code, so it is
prudent to add the restrictions now):
- Durations cannot appear in any part of a primary key, either for tables or
materialized views.
- Durations cannot be directly used as the element type of a `set`, nor can they
be used as the key type of a `map`. Because internal ordering on durations is
based on a byte-level comparison, this property of Cassandra was intended to
help avoid user confusion around ordering of collection elements.
- Secondary indexes on durations are not supported.
- "Slice" relations (<=, <, >=, >) are not supported on durations with `WHERE`
restrictions (like `SELECT ... WHERE span <= 3d`). Multi-column restrictions
only work with clustering columns, which cannot be `duration` due to the
first rule.
- "Slice" relations are not supported on durations with query conditions (like
`UPDATE my_table ... IF span > 5us`).
Backwards incompatibility note:
As described in the documentation [4], duration literals take one of two
forms: either ISO 8601 formats (there are three), or a "standard" format. The ISO
8601 formats start with "P" (like "P5W"). Therefore, identifiers that have this
form are no longer supported.
Fixes#2240.
[1] https://issues.apache.org/jira/browse/CASSANDRA-11873
[2] bfd57d13b7
[3] https://issues.apache.org/jira/browse/CASSANDRA-11871
[4] http://cassandra.apache.org/doc/latest/cql/types.html#working-with-durations
Version 5 of the native protocol for CQL [1] adds the `vint` and `unsigned vint`
types.
An unsigned integer encoded as a `vint` has a variable size based on the
magnitude of the value. The first byte indicates the total number of bytes.
For signed integers, a "zig-zag" encoding scheme ensures that small negative
values are encoded as short-length `vint`s (0 -> 0, -1 -> 1, 1 -> 2, 2 -> 3, -2
-> 4, etc).
[1] https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec
"sstables will sometimes have narrow/disjont ranges (e.g. LCS L1+).
This can be exploited when reading from a range of sstables by opening
sstables on-demand thus saving memory, processing and potentially I/O.
To achieve this combined_mutation_reader is refactored such that the
reader selection logic is moved-out into a reader_selector class.
combined_mutation_reader now takes a reader_selector instance in its
constructor and asks it for new readers for the current ring position
on every call to operator()().
At the moment two specializations of reader_selector are provided:
* list_reader_selector which implements the current logic, that is using
a provided mutation_reader list, and
* incremental_reader_selector which implements the on-demand opening
logic discussed above.
Fixes#1935"
* 'bdenes/optimize_combined_reader-v6' of https://github.com/denesb/scylla:
Add combined_mutation_reader_test unit test
Remove range_sstable_reader
Add incremental_reader_selector
Add reader_selector to combined_mutation_reader
sstable_set::incremental_selector: select() now returns a selection
incremental_reader_selector is a specialization of reader_selector for
the case when sstables have narrow and/or disjoint token ranges. To
exploit this it creates new readers on-demand when their sstable's
token range intersects with the current ring position.
combined_mutation_reader now accepts as a constructor argument a
reader_selector instance whoose task is to create new readers on
each call to operator()() if needed and possible.
This way it is possible to control how readers are created through
different specializations of reader_selector.
The previous logic is refactored into list_reader_selector which
is using a pre-provided mutation_reader list and forwards all of them to
combined_mutation_reader at once.
We are moving to aptly to release .deb packages, that requires debian repository
structure changes.
After the change, we will share 'pool' directory between distributions.
However, our .deb package name on specific release is exactly same between
distributions, so we have file name confliction.
To avoid the problem, we need to append distribution name on package version.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502312935-22348-1-git-send-email-syuu@scylladb.com>
A seletion contains - in addition to the list of sstables - a next_token
which is a hint as to what is the next best token to call select() with.
This should be the smallest token such that at the next call to
select() the least number of new sstables will be returned, without
skipping any.
"With this series, all the following cluster operations:
- bootstrap
- rebuild
- decommission
- removenode
will use the same code to do the streaming.
The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.
The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....
Later, we can introduce api for user to decide when to stop retrying and
the retry interval.
The benefits:
- All the cluster operation shares the same code to stream
- We can know the operation progress, e.g., we can know total number of
ranges need to be streamed and number of ranges finished in
bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
operation which usually takes long time to complete, e.g., when adding
a new node, currently if any of the existing node which streams data to
the new node had issue sending data to the new node, the whole bootstrap
process will fail. After this patch, we can fix the problematic node
and restart it, the joining node will retry streaming from the node
again.
- We can fail streaming early and timeout early and retry less because
all the operations use stream can survive failure of a single
stream_plan. It is not that important for now to have to make a single
stream_plan successful. Note, another user of streaming, repair, is now
using small stream_plan as well and can rerun the repair for the
failed ranges too.
This is one step closer to supporting the resumable add/remove node
opeartions."
* tag 'asias/use_range_streamer_everywhere_v4' of github.com:cloudius-systems/seastar-dev:
storage_service: Use the new range_streamer interface for removenode
storage_service: Use the new range_streamer interface for decommission
storage_service: Use the new range_streamer interface for rebuild
storage_service: Use the new range_streamer interface for bootstrap
dht: Extend range_streamer interface
We experienced 'Constructing RAID volume...' takes too much time on some AMIs,
this is because setup script stuck at 'yum -y install mdadm xfsprogs'.
We don't have to install these packages on AMI startup time, we should
preinstall them on AMI creating time.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502192796-21040-1-git-send-email-syuu@scylladb.com>
* Configuration for cluster_name is commented-out in config file.
* Default value set to empty string and if not rewritten by user then
warning is printed and value is reset to "ScyllaDB Cluster".
Fixes#2648.
Message-Id: <20170808113322.9313-1-daniel@scylladb.com>
ScyllaDB loves python & python loves ScyllaDB.
It would benefit the project to start enforcing some code guidelines
and basic QA with a linter along a PEP8 respect thanks to flake8.
This patch adds a tox config to at least start with an assessment
of the work to be done on all .py files in the code base.
To reduce its noise, tests on long lines (> 80char) are ignored
for now.
Signed-off-by: Ultrabug <ultrabug@gentoo.org>
Message-Id: <20170726134242.8927-1-ultrabug@gentoo.org>
index's file output stream uses write behind but it's not closed
when sstable write fails and that may lead to crash.
It happened before for data file (which is obviously easier to
reproduce for it) and was fixed by 0977f4fdf8.
Fixes#2673.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>
After this patch and the following patches to use the new
range_streamder interface, all the following cluster operations:
- bootstrap
- rebuild
- decommission
- removenode
will use the same code to do the streaming.
The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.
The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....
Later, we can introduce api for user to decide when to stop retrying and
the retry interval.
The benefits:
- All the cluster operation shares the same code to stream
- We can know the operation progress, e.g., we can know total number of
ranges need to be streamed and number of ranges finished in
bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
operation which usually takes long time to complete, e.g., when adding
a new node, currently if any of the existing node which streams data to
the new node had issue sending data to the new node, the whole bootstrap
process will fail. After this patch, we can fix the problematic node
and restart it, the joining node will retry streaming from the node
again.
- We can fail streaming early and timeout early and retry less because
all the operations use stream can survive failure of a single
stream_plan. It is not that important for now to have to make a single
stream_plan successful. Note, another user of streaming, repair, is now
using small stream_plan as well and can rerun the repair for the
failed ranges too.
This is one step closer to supporting the resumable add/remove node
opeartions.
* seastar f14d2a3...7a49ae5 (8):
> sharded: improve support for cooperating sharded<> services
> sharded: support for peer services
> semaphore: add a version of with_semaphore that takes a duration timeout
> scripts: perftune.py: fix the CPU mask generation for more than 64 CPUs
> Revert "future-utils: make when_all() (vector variant) exception safe"
> Revert "future-utils: fix gross compilation errors in when_all()"
> future-utils: fix gross compilation errors in when_all()
> future-utils: make when_all() (vector variant) exception safe
Includes change to batchlog_manager constructor to adapt it to
seastar::sharded::start() change.
This reverts commit 98757069a5. We have the
failure detector which will detect an unresponsive node and fail the RPC.
Adding a timeout can just introduce false positives.
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.
This patch adds this metric so we don't fly blind.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
We have problem to run fstrim with nomerges=2, so we need to change
the parameter to 1 during fstrim execution.
To do this, this fix changes follow things:
- revert dropping scylla_fstrim on Ubuntu 16.04/CentOS
- disable distribution provided fstrim script
- enable scylla_fstrim on all distributions
- introduce --set-nomerges on scylla-blocktune
- scylla_fstrim call scylla-blocktune by following order:
- 'scylla-blocktune --set-nomerges 1'
- 'fstrim' for each devices
- 'scylla-blocktune --set-nomerges 2'
Fixes#2649
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501531393-21109-1-git-send-email-syuu@scylladb.com>
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.
Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.
Fixes#2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.
Fixes#2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>
"Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This series makes
it to try matching digests first like a single partition read."
Fixes#2666.
* 'gleb/digest_scan' of github.com:cloudius-systems/seastar-dev:
storage_proxy: make range_slice_read_executor go through digest matching state
storage_proxy: add capability to read data/digest for non singular ranges
storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This patch makes
it to try matching digests first like a single partition read.
The change requires internode protocol changes since currently it is not
possible to ask for multi partition data/digest over RPC. It means that
the capability has to be guarded by new gossip feature flag which the
patch also adds.
Using --hostname to give the container a meaningful name is a good
practice, and make the monitoring dashboard easier to understand
Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170803081027.6675-1-tzach@scylladb.com>
never_speculating_read_executor always waits for all targets so
block_for parameter is always equal to targets.size(). No need to
to pass it explicitly.
"This series adds an option to use paging in internal query and use that for the
get compaction history function.
Internal paging will be done explicitly, to use paging, you first create a
state object (that contains the query as well) and use that state to get the
first page, the result will contain both the query result and a new state that
can be used to get the next page.
Fixes#2366"
* 'amnon/paged_compaction_history_v5' of github.com:cloudius-systems/seastar-dev:
system_keyspace: Use paging for get compaction history
Add paging for internal queries
query_options: Allows creating query_options from query_options
"This series tries to fix possible repair stuck."
Fixes#2660, #2661, #2662.
* tag 'asias/repair_stuck_v2.1' of github.com:cloudius-systems/seastar-dev:
repair: Make send_repair_checksum_range timeout
repair: Singal parallelism_semaphore in case of error
repair: Fix repair_tracker done
It specifies the maximum gossip shadow round time. It can be used to
reduce the gossip feature check time during node boot up.
For instance, when the first node in the cluster, which listed both
itself and other node as seed in the yaml config, boots up, it will try
to talk to other seed nodes which are not started yet. The gossip shadow
round will be used to fetch the feature info of the cluster. Since there
is no other seed node in the cluster, the shadow round will fail. User
can reduce the default shadow_round_ms option to reduce the boot time.
Fixes#2615
Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>
The timer is armed inside the section guarded by the _timer_reads_gate
therefore it has to be canceled after the gate is closed.
Otherwise we may end up with the armed timer after stop() method has
returned a ready future.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.
Fixes#2333"
* 'pi-cell-name/v1' of github.com:duarten/scylla:
tests/sstable_mutation_test: Test promoted index blocks are monotonic
sstables: Consider eoc when flushing pi block
sstables: Extract out converting bound_kind to eoc
It was meant to be run in the foreground since it is waited upon during
stop(), but as it is now from the stop() perspective it is completed
after first connection is accepted.
Fixes#2652
Message-Id: <20170801125558.GS20001@scylladb.com>
Queries with query page size equal or smaller than
zero are unpaged queries.
Count these kind of queries and make them a metrics
since they can ruin the performance of the system.
Message-Id: <20170731130004.25807-2-benoit@scylladb.com>
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
stop the last bucket."
Fixes#2650.
* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
database: remove latency from the system table
estimated histogram: return a smaller histogram
Since the discovery of std::exchange(x, {}) move_and_clear has become
obsolete. Beside, the name was wrong, it did not clear the vector but
recreated it meaning that any allocated memory wasn't reused (not that
it mattered in the existing usages).
Message-Id: <20170731123549.10887-1-pdziepak@scylladb.com>
"These patches optimise combined_mutation_reader for cases where the majority
of mutation_readers is disjoint.
perf_fast_forward:
Results are medians of 3 of fragments/s as reported by perf_fast_forward.
Command:
perf_fast_forward -c1 --enable-cache
small: small-partition-skips (read=1, skip=0)
large: large-partition-skips (read=1, skip=0)
before after diff
small 195753 238196 +22%
large 1244325 1359096 +9%
perf_simple_query:
Results are medians of 10 of reads/s as reported by perf_simple_query.
Command:
perf_simple_query -c1
before 98651.40
after 104554.85
diff +6%"
* tag 'avoid-merge_mutations/v1' of https://github.com/pdziepak/scylla:
combined_mutation_reader: avoid unnecessary merge_mutations()
combined_mutation_reader: do not pop mutation with different key
"This series ensure that when we retry a memtable flush, we re-acquire the
flush permit that was previously released. It also ensures we don't hold
the sstable read lock for the duration of the sleep leading to the retry.
To achieve that cleanly we refactor the way the permit lifecycle is managed
by employing a RAII-based approach.
We also improve the latency of writes blocked on virtual dirty by releasing
the flush permit before fsyncing the sstables. There are additional avenues
for performance improvements on top of this one."
* 'memtable-flush-additional-fixes/v4' of github.com:duarten/scylla:
column_family: Re-acquire flush permit in case of error
column_family: Don't hold sstable read lock when retrying flush
sstables: Release the flush permit before fsyncing
sstables: Introduce write_monitor
database: Extract out dirty_memory_manager
dirty_memory_manager: Refactor flush permit lifetime management
dirty_memory_manager: Invert permit acquisition order
memtable_list: Register different seal functions for each behaviour
Merging mutations is quite an expensive operation. The creation of
streamed mutation merger involves several allocations (mostly coming
from various std::vector) and then all mutation_fragments need to go
through a heap.
All this is completely unnecessary if there is only one mutation, so
let's skip a call to merge_mutations() in such cases. This also means
that we can reuse memory allocated by _current vector if merge is not
required.
Originally, the loop insidecombined_mutation_reader::next() so that it
was popping mutation from the heap and when it encountered one with a
different decorated key it was pushed back and the ones accumulated so
far merged and emitted. In other words, every time the reader progressed
to the next mutation it did needless pop and push operations on the
heap.
This patch rearranges the code so that the key of the next mutation is
compared before it is popped from the heap.
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.
This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
serialized_action_tests depends on the fact that first part of the
serialized_action is executed at cetrtain points (in which it reads a
global variable that is later updated by the main thread).
This worked well in the release mode before ready continuations were
inlined and run immediately, but not in the debug mode since inlining
was not happening and the main seastar::thread was missing some yield
points.
Message-Id: <20170731103013.26542-1-pdziepak@scylladb.com>
1. assert() is not constexpr.
2. can't use static_assert(), because the contructor may be called in a non-constexpr
environment; moved to log_histogram
3. pow2_rank() uses count_leading_zeros() which is not constexpr; split
into constexpr and non-constexpr versions
4. duplicated number_of_buckets() because bucket_of() can't be constexpr due to pow2_rank
Message-Id: <20170726105444.32698-1-avi@scylladb.com>
"This series ensure that when we retry a memtable flush, we re-acquire the
flush permit that was previously released. It also ensures we don't hold
the sstable read lock for the duration of the sleep leading to the retry.
To achieve that cleanly we refactor the way the permit lifecycle is managed
by employing a RAII-based approach.
We also improve the latency of writes blocked on virtual dirty by releasing
the flush permit before fsyncing the sstables. There are additional avenues
for performance improvements on top of this one."
* 'memtable-flush-additional-fixes/v3' of github.com:duarten/scylla:
column_family: Re-acquire flush permit in case of error
column_family: Don't hold sstable read lock when retrying flush
sstables: Release the flush permit before fsyncing
sstables: Introduce write_monitor
database: Extract out dirty_memory_manager
dirty_memory_manager: Refactor flush permit lifetime management
dirty_memory_manager: Invert permit acquisition order
memtable_list: Register different seal functions for each behaviour
main: Don't catch polymorphic exceptions by value
* seastar a14d667...54e940f (8):
> Merge "Prometheus to use output stream" from Amnon
> http_test: Fix an http output stream test
> build: harden try_compile_and_link output temporary file
> configure: disable exception scalability hack on debug build
> build: don't perform test compiles to /dev/null
> Provide workaround for non scaleable c++ exception runtime
> Merge "Add output stream to http message reply" from Amnon
> configure.py: use user provided compiler flags when checking for features
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.
Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.
Fixes#2551
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501187555-4629-1-git-send-email-syuu@scylladb.com>
"This patch series backports scalability fix for _Unwind_Find_FDE and modifies
out CentOS package to use our libgcc and libstdc++ which are needed to make
use of the fix instead of locally installed ones."
Ref #2646 (fixes on RHEL 7 and related only)
* 'gleb/exception-gcc-fix-v2' of github.com:cloudius-systems/seastar-dev:
dist/redhat: Make scylla rpm depend on scylla-libgcc and scylla-libstdc++ and use it instead of locally installed one
dist/redhat: Backport scalability fix of _Unwind_Find_FDE to out gcc
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.
This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.
In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:
concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test
Allowing only one active and one queued pull request per remote
endpoint is enough.
* tag 'tgrabiec/dont-accumulate-schema-pulls-v2' of github.com:scylladb/seastar-dev:
migration_manager: Log schema pulls
migration_manager: Prevent pull requests from accumulating
utils: Introduce serialized_action
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.
In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:
concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test
Allowing only one active and one queued pull request per remote
endpoint is enough.
When flushing a promoted index block using a range tombstone cell name
as a bound, use the right eoc value instead of always writing
composite::eoc::none.
Fixes#2333
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
consume_mutation_fragments_until() allows consuming mutation fragments
until a specified condition happens. This patch reorganises its
implementation so that we avoid situations when fill_buffer() is called
with stop condition being true.
Message-Id: <20170727122218.7703-1-pdziepak@scylladb.com>
When we install scylla metapackage with version (ex: scylla-1.7.1),
it just always install newest scylla-server/-jmx/-tools on the repo,
instead of installing specified version of packages.
To install same version packages with the metapackage, limited dependencies to
current package version.
Fixes#2642
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170726193321.7399-1-syuu@scylladb.com>
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.
Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.
Fixes#2551
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170727110730.613-1-syuu@scylladb.com>
database::make_sstable_reader() creates a reader which will need to
obtain a semaphore permit when invoked. Therefore, each read may
create at most one such reader in order to be guaranteed to make
progress. If the reader tries to create another reader, that may
deadlock (or for non-system tables, timeout), if enough number of such
readers tries to do the same thing at the same time.
Avoid the problem by dropping previous reader before creating a new
one.
Refs #2644.
Message-Id: <1501152454-4866-1-git-send-email-tgrabiec@scylladb.com>
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.
This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.
Fixes#2273
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
This patch remove the latency histograms from the system table, it also
extend the already existing exclusion to all system keyspaces.
It also uses the new get_histogram API to set a minimal bucket size to
100 microseconds.
The current histogram contains 91 buckets, this is a very high
resolution with a high upper limit.
To reduce traffic passed, between scylla and the prometheus, this patch
generate a smaller histogram.
It limit the number of buckets (16 by default), set a lower limit to the
lowest bucket, and uses 2 as the bucket coeficient.
Highest empty buckets will not be reported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
estimated histogram
These patches contain some minor fixes for performance regression reported
by perf_fast_forward after partial cache was merged. The solution is still
far from perfect, there is one case that still has 30% degradation, but
there is some improvement so there is no reason to hold these changes back.
Refs #2582.
Some numbers:
before - before cache changes were merged
(555621b537)
cache - at the commit that introduced the partial cache
(9b21a9bfb6)
after - recent master + this series
(based on e988121dbb)
Differences are shown relative to "before".
Testing effectiveness of caching of large partition, single-key slicing reads:
Large partitions, range [0, 500000], populating cache
before cache after
1636840 1013688 1234606
-38% -25%
Large partitions, range [0, 500000], reading from cache
before cache after
2012615 3076812 3035423
+53% +51%
Testing scanning small partitions with skips.
reading small partitions (skip 0)
before cache after
227060 165261 200639
-27% -11%
skipping small partitions (skip 1)
before cache after
29813 27312 38210
-8% +28%
Testing slicing small partitions:
slicing small partitions (offset 0, read 4096)
before cache after
195282 149695 180497
-23% -8%
* https://github.com/pdziepak/scylla.git perf_fast_forward-regression/v3:
sstables: make sure that fill_buffer() actually fills buffer
mutation_merger: improve handling of non-deferring fill_buffer()s
partition_snapshot_row_cursor: avoid apply() in single-version cases
sstables: introduce decorated_key_view
ring_position_comparator: accept sstables::decorated_key_view
sstable: keep a pre-computed token in summary_entry
sstables: cache token in index entries
index_reader: advance_and_check_if_present() use index_comparator
ring_position_comparator: drop unused overloads
cache_streamed_mutation: avoid moving clustering_row
streamed_mutation: introduce consume_mutation_fragments_until()
cache_streamed_mutation: use consumer based read_context reader
rows_entry: make position() inlineable
mutation_fragment: make destructor always_inline
keys: introduce compound_wrapper::from_exploded_view()
sstables: avoid copying key components
compound_compat: explode: reserve some elements in a vector
cache: short-circut static row logic if there are no static columns
cache: use equality comparators instead of tri_compare
sstables: avoid indirect calls to abstract_type::is_multi_cell()
loading_cache invokes a timer that may issue asynchronous operations
(queries) that would end with writing into the internal fields.
We have to ensure that these operations are over before we can destroy
the loading_cache object.
Fixes#2624
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501096256-10949-1-git-send-email-vladz@scylladb.com>
We need to consider the _live_endpoints size. The nr_live_nodes should
not be larger than _live_endpoints size, otherwise the loop to collect
the live node can run forever.
It is a regression introduced in commit 437899909d
(gossip: Talk to more live nodes in each gossip round).
Fixes#2637
Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>
Equality comparator may be much cheaper than the fully fledged
trichotomic comparator, especially if the component types are byte order
equal but not byte order comparable.
When we are exploding a compound key we know already that there is more
than one component, but we have no easy way of determining how many of
them are going to be there. Let's reserve space for a few elements so
that we avoid an excessive number of reallocations in case of
medium-sized keys.
mutation_fragment destructor was already made inline-friendly by moving
most of the logic to a separate function. However, the compiler still is
quite reluctant to inline it in certain cases, so let's give it a
stronger hint.
consume_mutation_fragments_until() is a consumer based interface that
avoids indirect calls and continuation overhead present in the naive
streamed_mutation::operator() approach.
clustering_row can stores quite a lot of data internally which makes its
move constructor not exactly cheap.
If possible it is better to move mutation_fragment around as it keeps
everything externally. This also avoids some cases when clustering row
would be extracted from mutation_fragment only to be made to create
another mutation_fragment later.
When a sstable reader is fast forwarded some index entries may be read
(and compared) multiple times. This patch makes sure that once a token
is computed we keep it around and reuse if the entry is accessed again.
Each sstable index lookup involves a binary search in the summary and
each time a partition key of summary entry is compared with anything its
token needs to be calculated.
Since we keep summary in the memory all the time it is better to also
keep the tokens around.
ring_position_comparator has overloads for comparing ring_positions as
well as sstables::key_view. In the case of the latter it needs to
compute the token of the key. However, the sstable layer could cache
some tokens so let's allow the comparator callers to provide it
directly.
It is possible that a call to fill_buffer() will return an immediately
ready future. This patch avoids uncontrolled recursion in case when all
merged streamed mutation do not defer ini fill_buffer() and also
optimises for non-deferring case by avoiding some of the logic.
streamed_mutation::impl::fill_buffer() is supposed to either push
mutation fragments to the buffer or set EOS flag. However, it was
possible that mp_row_consumer would return proceed::no if a skip was
needed without satisfying any of these conditions.
"This patch series addresses some feedback from the preliminary
HACKING.md, adds some new content, and updates the README file with
some quick-start information."
* 'jhk/better_hacking/v3' of github.com:hakuch/scylla:
README.md: Add quick-start section and defer to `HACKING.md`
HACKING.md: `CMakeLists.txt` for analysis works for other IDEs too
HACKING.md: Add details and examples for unit tests
HACKING.md: Add section for project dependencies
HACKING.md: Describe releases and tags
HACKING.md: Re-work "building" section, including memory needs
HACKING.md: Update ccache recommendations
HACKING.md: Update "Contributing" URL
If a node is notified of a schema change where the schema's dropped
columns have changes, that node will miss the changes to the dropped
columns. A scenario where this can happen is where a column c is
dropped, then added as a different typed, and then dropped again, with
a node n having seen the first drop and being notified of the
subsequent add and drop.
Fixes#2616
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-1-duarte@scylladb.com>
Counters write path on leader is completely different than on any other
replica (non-leaders share write path between counters and regular
columns). This patch makes sure that counter writes performed on leader
are added to appropriate metrics.
Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>
* git@github.com:raphaelsc/scylla.git row_cache_fixes:
db: atomically synchronize cache with changes to the snapshot
db: refresh row cache's underlying data source after compaction
Empty clustering key range is perfectly valid and signifies that the
reader is not interested in anything but the static row. Let's not
make it mean anything else.
Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>
cache_streamed_mutation assumed that at least one clustering range was
specified. That was wrong since the readers are allowed to query just
for a static row (e.g. counter update that modifies only static
columns).
Fixes#2604.
Message-Id: <20170725131220.17467-1-pdziepak@scylladb.com>
To support both version and patch release, the version server now returns
a patchversion parameter that include the latest minor version's patch
release.
The housekeeping should return a separate message if the current
minor version is not with the latest patch release, and a message if the version was
changed.
For example, if a user is using version 1.6.1 it should get a warning
that he need to update if 1.6.2 is available and in addition a warning it
should upgrade if version 1.7 is out.
Examples:
$ scylla-housekeeping version --version 1.6.2
Your current Scylla release is 1.6.2, while the latest patch release is 1.6.4, and the latest minor release is 1.7.2 (recommended)
$ scylla-housekeeping version --version 1.7.1
You current Scylla release is 1.7.1 while the latest patch release is 1.7.2 is available, update for the latest bug fixes
$ scylla-housekeeping version --version 1.7.1
You current Scylla release is 1.7.1 while the latest patch release is 1.7.2, update for the latest bug fixes and improvements
Fixes#1972
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170725095455.6450-1-amnon@scylladb.com>
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.
Fixes#2570.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
updates to cache and snapshot (i.e. sstable set) aren't synchronized, so
it may happen that cache update for memtable flush will use wrong snapshot
version, and that violates cache invariant of each partition entry only
reflecting one snapshot.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Boost 1.55 accidentally removed support for "range for" on
recursive_directory_iterator (previous and latter versions do
support it). Use old-style iteration instead.
Message-Id: <20170724080128.8824-1-avi@scylladb.com>
The 'free' attribute is not updated for all pages belonging to a large
object, so we can't use it to determine if the page is allocated or
not. More reliable way is to check if it belongs to any free span.
Message-Id: <1500648094-20039-1-git-send-email-tgrabiec@scylladb.com>
This is in order to avoid frequent misses which have a relatively high
cost. A miss means we need to fetch schema definition from another
node and in case of writes do a schema merge.
If the schema is kept alive only by the incoming request, then it
will be forgotten immediately when the request is done, and the next
request using the same schema version will miss again.
Refs #2608.
Message-Id: <1500632447-10104-1-git-send-email-tgrabiec@scylladb.com>
"Fixes issues uncovered in longevity test (#2608).
Main problem is that due to time drift scylla_tables.version column
may not get deleted on all nodes doing the schema merge, which will
make some nodes come up with different table schema version than others.
The inconsistency will not heal because scylla_tables doesn't
take part in the schema sync. This is fixed by the last patch.
This will cause nodes to constantly try to sync the schema, which under
some conditions triggers #2617."
* tag 'tgrabiec/fix-table-schema-version-inconsistency-v1' of github.com:scylladb/seastar-dev:
schema_tables: Add scylla_tables to ALL
schema: Make schema_mutations equality consistent with digest
schema_tables: Extract compact_for_schema_digest()
schema_tables: Always drop scylla_tables::version
global_schema_ptr ensures that schema object is replicated to other
cores on access. It was replicating the "synced" state as well, but
only when the shard didn't know about the schema. It could happen that
the other shard has the entry, but it's not yet synced, in which case
we would fail to replicate the "synced" state. This will result in
exception from mutate(), which rejects attempts to mutate using an
unsynced schema.
The fix is to always replicate the "synced" state. If the entry is
syncing, we will preemptively mark it as synced earlier. The syncing
code is already prepared for this.
Refs #2617.
Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>
there could be a lot of compactions when querying for compaction
history.
This patch changes the query to use paging. It would collect all results
when returning to the caller.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Usually, internal queries are used for short queries. Sometimes though,
like in the case of get compaction history, there could be a large
amount of results. Without paging it will overload the system.
This patch adds the ability to use paging internally.
Using paging will be done explicitely, all the relevant information
would be store in an internal_query_state, that would hold both the
paging state but also the query so consecutive calls can be made.
To use paging use the query method with a function.
The function gets beside a statement and its parameters a function that
will be used for each of the returned rows.
For example if qp is a query_processor:
qp.query("SELECT * from system.compaction_history", [] (const cql3::untyped_result_set::row& row) {
....
// do something with row
...
return stop_iteration::no; // keep on reading
});
Will run the function on each of the compaction history table rows.
To stop the iteration, the function can return stop_iteration::yes.
* https://github.com/pdziepak/scylla.git perf_fast_forward_improvements/v1:
perf_fast_forward: move global state to global scope
perf_fast_forward: move tests groups to separate functions
perf_fast_forward: allow running only selected test groups
perf_fast_forward: use consumer interface for reading
streamed_mutation
So that scylla_tables takes part in the digest and in mutations sent
as part of schema sync. Otherwise inconsistencies in scylla_tables
will not heal.
Refs #2608.
Digest only looks like live values, ignoring deletion
information. Equality should be consistent with that, so that schemas
considered equal do not trigger the alter path unnecessarily.
It can happen that due to time drift between nodes, the incoming
"version" cell will have higher timestamp than api::new_timestamp().
In such case the column would not be dropped and would cause version
mismatch between nodes.
Ensure it's always covered by using max of current time and cell's
timestamp.
Refs #2608.
By default build_deb.sh destroys all previous build image to make sure we don't
have environment dependent issue, but it's takes time to build distribution root
image (.tgz in pbuilder) from scratch.
--no-clean option is for skipping create .tgz stage, use previously built image,
to make build time shorter.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1500542094-12946-1-git-send-email-syuu@scylladb.com>
Using streamed_mutation::operator() is undesirable as it introduces an
indirect call and a continuation overhead for each emitted mutation
fragment. Consumer interface is the preferred method of reading streamed
mutations.
All test perf_fast_forward test cases currently live in the main
function. This patch moves the state they rely on to a global scope
so that it will be easier to extract these tests to individual
functions.
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.
The said maximum is recommended to be set 50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.
The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.
Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."
* 'memtable-controller-v5' of https://github.com/glommer/scylla:
simple controller for memtable/streaming writer shares.
restrict background writers to 50 % of CPU.
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.
Export "is_system_keyspace" and use accordingly.
Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
"Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555)
by reverting back to using the old layout in memory and thus also
in across-node requests. We still use the new v3 layout in schema
tables (needed by drivers and external tools). Translations happen
when converting to/from schema mutations."
* tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev:
schema: Revert back to the 1.7 layout of static compact tables in memory
schema: Use v3 column layout when converting to/from schema mutations
schema: Encapsulate column layout translations in the v3_columns class
"This series improves the streaming error handling so that when one side of the
streaming failed, it will propagate the error to the other side and the peer
will close the failed session accordingly. This removes the unnecessary wait and
timeout time for the peer to discover the failed session and fail eventually.
Fix it by:
- Use the complete message to notify peer node local session is failed
- Listen on shutdown gossip callback so that we can detect the peer is shutdown
can close the session with the peer
Fixes#1743"
* tag 'asias/streaming/error_handling_v2' of github.com:cloudius-systems/seastar-dev:
streaming: Listen on shutdown gossip callback
gms: Add is_shutdown helper for endpoint_state class
streaming: Send complete message with failed flag when session is failed
streaming: Handle failed flag in complete message
streaming: Do not fail the session when failed to send complete message
streaming: Introduce send_failed_complete_message
streaming: Do not send complete message when session is successful
streaming: Introduce the failed parameter for complete message
streaming: Remove unused session_failed function
streaming: Less verbose in logging
streaming: Better stats
Due to the asynchronous nature of view update propagation, results
might still be absent from views when we query them. To be able to
deterministically assert on view rows, this patch retries a query a
bounded number of times until it succeeds.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718212646.2958-1-duarte@scylladb.com>
We are using C* 3.x compatible layout in schema tables but want to
keep using the 1.7 layout in memory for compatibility during rolling
upgrade. This patch switches the schema and schema_builder classes
back to the old layout. Translation of layout happens when converting
to/from schema mutations.
Notable changes:
1) Includes a revert of commit 6260f31e08
"thrift: Update CQL mapping of static CFs".
2) Brings back the "default_validation_class" schema attribute. In v3
it can be dervied from column definitions, but in v2 it can't, so
we have to store it.
3) legacy_schema_migrator and schema_builder don't have to do
conversions to v3, this is now handled by the v3_columns
class. schema_builder works with the same layout as schema, that
is v2.
4) Includes a revert of commit 66991a7ccb
"v3 schema test fixes"
Fixes#2555.
"Time window strategy was introduced to address several limitations of
date tiered strategy. In addition, its options are much easier to reason
about, basically just window size and window unit.
TWCS will work to keep only one sstable in each window. So the only real
optimization needed is to align partition key to the window.
Size tiered strategy is used to reduce write amplification when compacting
the incoming window.
For more details: https://issues.apache.org/jira/browse/CASSANDRA-9666
Fixes #1432."
* 'twcs_v2' of github.com:raphaelsc/scylla:
tests: add tests for time window compaction strategy
compaction: wire up time window compaction strategy
compaction/twcs: override default values with options in schema
sstables: implement time window compaction strategy
sstables: import TimeWindowCompactionStrategy.java
This patch provides a rather trivial implementation of hash() for
collection types.
It is needed for view building, where we hold mutations in a map
indexed by partition keys (and frozen collection types can be part of
the key).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718192107.13746-1-duarte@scylladb.com>
This patch introduces a simple controller that will adjust memtables CPU
shares, trying to keep it around the soft limit: if we start going below
it means we're too fast (unless we are idle) and shares are adjusted
downwards. If we start going above it means we're too fast and shares
are adjusted upwards.
I have tested this extensively in a single-CPU setup with various
CPU-bound workloads while tracking virtual dirty and the results are
good, with virtual dirty fluctuating only slightly, somewhere within the
desired range.
Exceptions to this are:
1) when the load is very light - the idle system goes faster, and that's
ok
2) when the load is very high - as foreground requests dominate we can't
flush fast enough and hit the hard limit. However, in such scenarios
the memtable shares do hit its maximum, and the results are no worse
than they are right now and this will only be fixed by CPU-limiting the
actual requests.
This feature can be disabled with a config option - that is scheduled to
go away as we acquire more confidence in this. When the feature is
disabled, all background writers (streaming, compaction, memtables) will
share the same scheduling group, with static quotas.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.
The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.
In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.
The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.
While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.
As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When a node shutdown itself, it will send a shutdown status to peer
nodes. When peer nodes receives the shtudown status update, they are
supposed to close all the sessions with that node becasue the node is
shutdown, no need to wait and timeout, then fail the session.
This change can speed up the closing of sessions.
Currently, send_complete_message is not used. We will use it shortly in
case the local session is failed. Send a complete message with failed
flag to notify peer node that the session is failed so that peer can
close the session. This can speed up the closing of failed session.
Also rename it to send_failed_complete_message.
it will be later converted to C++. Imported from latest scylla-
tools-java repository. Checked that it doesn't lack anything.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
A user reported scylla-server.service does not able to run on their cloud instance, because of hugeadm.
(hugeadm says the kernel does not support huge pages.)
We don't need it for posix mode, so move it in dpdk mode.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1500367219-8728-1-git-send-email-syuu@scylladb.com>
Instead of retrying, just drop mutations that raced with a truncate.
* git@github.com:duarten/scylla.git truncate-reorder/v1:
database: Rename replay_position_reordered_exception
database: Drop mutations that raced with truncate
The complete_message is not needed and the handler of this rpc message
does nothing but returns a ready future. The patch to remove it did not
make into the Scylla 1.0 release so it was left there.
Use this flag to notify the peer that the session is failed so that the
peer can close the failed session more quickly.
The flag is used as a rpc::optional so it is compatible use old
version of the verb.
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.
Fixes#2603.
Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
branch 'tgrabiec/schema-migration-fixes' of github.com:scylladb/seastar-dev:
schema: Use proper name comparator
legacy_schema_migrator: Properly migrate non-UTF8 named columns
schema_tables: Store column_name in text form
legacy_schema_migrator: Migrate columns like Cassandra
schema_builder: Add factory method for default_names
legacy_schema_migrator: Simplify logic
thrift: Don't set regular_column_name_type
schema: Use proper column name type for static columns
schema: Fix column_name_type() for static compact tables
schema: Introduce clustering_column_at()
thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type
partition_slice_builder: Use proper column's type instead of regular_column_name_type()
This replaces column_definition::name_comparator, which incorrectly
assumes that names are always utf8, with name_compare moved from
schema::rebuild() and unifies usages.
Currently migrator assumed all columns are utf8-named, which
doesn't have to be the case for static compact tables.
Refs #2597.
Due to #2573, we can assume that Scylla wasn't used with non-utf8
column names, and that old names are always in textual form.
This fixes generation of synthetic columns for static compact tables.
Current code always generates synthetic clustering column with utf8
type and synthetic regular column with bytes type (in schema_builder).
That's fine when creating a new CQL table, but not when migrating
existing tables created via thrift API.
Fixes#2584.
This also migrates empty compact value columns like Cassandra
does. Such columns are present in compact tables without regular
columns, e.g.:
create table test (k int, ck int, primary key (k, ck)) with compact storage;
They should be migrated to a synthetic regular column with
empty_type type and a non-empty name.
The expression "is_dense.value_or(true)" is always true inside the if,
so drop it.
This allows us to drop temporary calulated_is_dense.
We can also get rid of one of the if branches by extracting
builder.set_is_dense() outside.
After f5dae826ce, static columns not
always have utf8 column names. For static compact tables it's
determined by the cell name comparator type, which is equal to the
type of the synthetic clustering column.
Caused various errors with static thrift tables with non-utf8
comparator.
Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk
with write-back cache, and forever on a spinning disk.
Message-Id: <20170716081653.10018-1-avi@scylladb.com>
We will be creating links to those sstable's files, and those don't work
if the data directory and the test sstable are on different devices.
Copying the files to the same directory fixes the problem.
Message-Id: <20170716090405.14307-1-avi@scylladb.com>
Rename replay_position_reordered_exception to
mutation_reordered_with_truncate_exception for more precision, since
this is the only situation where this exception can be thrown.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We don't ensure mutations are applied in memory following the order of their
replay positions. A memtable can thus be flushed with replay position rp,
with the new one being at replay position rp', where rp' < rp. This breaks
an intrinsic assumption in the code, which this series addresses.
Fixes#2074
branch memtable-flush/v3 of git@github.com:duarten/scylla.git:
commitlog: Always flush latest memtable
column_family: More precise count of switched memtables
column_family: Fix typo in pending_tasks metric name
column_family: More precise count of pending flushes
dirty_memory_manager: Remove unnecessary check from flush_one()
column_family: Don't rely on flush_queue to guarantee flushes finished
column_family: Don't bother closing the flush_queue on stop()
column_family: Stop using flush_queue
column_family: Remove outdated comment about the flush_queue
memtable: Stop tracking the highest flushed rp
Clang is happy to create a vector<data_value> from a {}, a {1, 2}, but not a {1}.
No doubt it is correct, but sheesh.
Make the data_value explicit to humor it.
Message-Id: <20170713074315.9857-1-avi@scylladb.com>
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.
There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.
This patch fixes this and enforces the expected order.
Fixes#2531Fixes#2593
Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
Since we no longer enforce that mutations are applied in memory
ordered by their replay_positions, the way the highest_flush_rp is
being tracked is no longer correct.
The invariant it was used to maintain no longer exists, so we can get
rid of it together with the assertion on the highest_flush_rp on
flush().
Fixes#2074
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since commitlog ordering requirements have been relaxed, we now keep
the set of replay_positions seen by a memtable in a set, which we then
use to clean up relevant segments in the commitlog. This means that
the guarantees provided by the flush_queue are no longer necessary.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When stopping a column family we issue a flush(), for which we wait.
Since writes are supposed to have stopped coming in, and also new
flush requests, there's no need to call and wait for the flush_queue
to be closed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. So, use a phased_barrier() to
ensure that calling flush() returns a future that completes when all
flushes up to that point have finished.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We don't need to check whether a memtable is empty in flush_one(), as
that must be checked later, during the actual sealing.
The condition itself is rare and is checked already after the potentially
contented semaphore has been acquired.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures we update the count of pending flushes in the same
place as we update the stats across column families, which is more
correct since it only accounts for actual flushes and not those of
empty memtables or that have been coalesced together.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The memtable_switch_count metric is supposed to count the number of
times a flush has resulted in the memtable being switched out, but we
were incrementing the count regardless of whether we tried to flush an
empty memtable or two or more flushes were coalesced into one. This
patch fixes this by moving the metric to where the memtable is
actually switched.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. When flushing commit log segments,
ensure we flush the latest memtable.
Refs #2074
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This series aims to fix the "serving invalid (old) values" issue in the
loading_cache (issue #2590) by arming the timer with a period that equals
min(expire, refresh).
We are still trying to optimize the main case where 'expire' is
significantly longer than 'refresh' period.
We don't want to add any additional logic in the fast path and this
series gives the immediate solution for the issue above while not adding
any additional CPU cycle to the fast path."
* 'loading_cache_short_expired-v2' of https://github.com/vladzcloudius/scylla:
utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
utils::loading_cache: make a timer use a loading_cache_clock_type clock as a source
Make the descriptions of permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries more readable and more related to what they really
do.
Mention the none-zero value requirement for the permissions_update_interval_in_ms and
the permissions_cache_max_entries when the permissions cache is enabled.
Adjust the parameters description in the scylla.yaml too.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499957053-31792-1-git-send-email-vladz@scylladb.com>
Arm the timer with a period that is not greater than either the permissions_validity_in_ms
or the permissions_update_interval_in_ms in order to ensure that we are not stuck with
the values older than permissions_validity_in_ms.
Fixes#2590
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Current algorithm was marking tables with regular columns not named
"value" as not dense, which doesn't have to be the case. It can be
either way.
It should be enough to look at clustering components. If there is a
clustering key, then table is dense if and only if all comparator
components belong to the clustering key.
If there is no clustering key, then if there are any regular columns
we're sure it's not dense.
Fixes#2587.
Message-Id: <1499877777-7083-1-git-send-email-tgrabiec@scylladb.com>
Cassandra 3.10 added the `duration` type [1], intended to manipulate date-time
values with offsets (for example, `now() - 2y3h`).
The full implementation of the `duration` type in Scylla requires support
for version 5 of the binary protocol, which is not yet available.
In the meantime, this patch patch adds the implementation of the underlying type
for the eventual `duration` type. Included is also the ported test suite from
the reference implementation and additional tests.
Related to #2240.
[1] https://issues.apache.org/jira/browse/CASSANDRA-11873
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <b1e481da103efee82106bf31f261c5a1f4f8d9ca.1499885803.git.jhaberku@scylladb.com>
query_options object cannot be changed after it was created. For
internal uses, like internal query paging, it is needed to create a new
object based on some of the data from an existing one with a new paging
state.
This patch adds a constructor from a unique_ptr and paging state.
using unique_ptr behave similar to move modify constructor.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Since base tables no longer look for their views, we need to parse
base tables first so that when we add a view we can fetch and connect
it to its base table.
When announcing view table mutations to other nodes we always include
the base table mutations, so there's no need to expect a view being
added before its base table.
Found out while testing view building.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170712172115.2960-1-duarte@scylladb.com>
"Currently new nodes calculate digests based on v3 schema mutations,
which are very different from v2 mutations. As a result they will
use schemas with different table_schema_version that the old nodes.
The old nodes will not recognize the version and will try to request
its definition. That will fail, because old nodes don't understand
v3 schema mutations.
To fix this problem, let's preserve the digests during migration,
so that they're the same on new and old nodes. This will allow
requests to proceed as usual.
This does not solve the problem of schema being changed during
the rolling upgrade. This is not allowed, as it would bring the
same problem back.
Fixes #2549."
* tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev:
tests: Add test for concurrent column addition
legacy_schema_migrator: Set digest to one compatible with the old nodes
schema_tables: Persist table_schema_version
schema_tables: Introduce system_schema.scylla_tables
schema_tables: Simplify read_table_mutations()
schema_tables: Resurrect v2 read_table_mutations()
system_keyspace: Forward-declare legacy schemas
legacy_schema_migrator: Take storage_proxy as dependency
Use name of the existing preceeding column with restriction
(last_column) instead of assuming that the column right after the
current column already has restrictions.
This will yield an error message that is different from that of
Cassandra, albeit still a correct one.
Fixes#2421
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>
DowngradingConsistencyRetryPolicy uses live replicas count from
Unavailable exception to adjust CL for retry, but when there are pending
nodes CL is increased internally by a coordinator and that may prevent
retried query from succeeding. Adjust live replica count in case of
pending node presence so that retried query will be able to proceed.
Fixes#2535
Message-Id: <20170710085238.GY2324@scylladb.com>
"most of changes are to improve maintainability of the strategy but
the ones that are introduced by the following patches:
lcs: do not check if level 0 can be promoted twice
lcs: remove quadratic behavior from L0 compaction
lcs: partially sort candidates that will be trimmed
lcs: only demote sstable from level higher than target one"
* 'lcs_improvements_2' of github.com:raphaelsc/scylla:
lcs: only demote sstable from level higher than target one
lcs: improve indentation for get_overlapping_starved_sstables
lcs: improve indentation for get_compaction_candidates
lcs: partially sort candidates that will be trimmed
lcs: remove quadratic behavior from L0 compaction
lcs: introduce private interface
lcs: make some member functions static
lcs: make some functions const qualified
lcs: remove add method
lcs: extract code for higher levels compaction from get_candidates_for
lcs: simplify code to get candidates for higher levels
lcs: extract round-robin heuristic for even distribution of keys into function
lcs: update outdated comments for level 0 compaction
lcs: improve worth_promoting_L0_candidates interface
lcs: do not check if level 0 can be promoted twice
lcs: extract code for level 0 compaction from get_candidates_for
CQL reply may contain metadata that describes columns present in the
response including the information about their type.
However, Scylla incorrectly reports counter types as bigint. The
serialised format of counters and bigint is exactly the same, which
could explain why the problem hasn't been noticed earlier but it is a
bug nevertheless.
Fixes#2569.
Message-Id: <20170711130520.27603-1-pdziepak@scylladb.com>
Calculate and set digest using v2 mutations so that digests are the
same before and after migration. This is neeed so that no schema
definition exchange is required during rolling upgrade.
Fixes#2549.
When migrating schema tables from v2 to v3, mutations underlying
table schema will change, and so will their digest. However, we want
the digest to be the same on new nodes as on the old nodes, because
schema exchange is not possible between the two nodes, so they
must to request schema definitions from each other.
The solution is to make the digest persistable, so that it sticks to
given table schema, surviving both migration and node restarts. On
migration from v2, the digest will be calculated from v2 mutations, so
it will be the same on new and old nodes.
It will be used to store Scylla spcific table metadata. We cannot
store it in the standard "tables" table for compatibility reasons -
Cassandra will fail to read schema if it encounteres columns it is not
expecting.
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.
Fixes#2432.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
I will split code for higher levels compaction into functions first
before putting it into its own function too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The default of 2ms is somewhat arbitrary. Now that we have a lot more
mileage deploying Scylla applications in production it does sound not
only arbitrary, but high.
In particular, it is really hard to achieve 1ms latencies in the face of
CPU-heavy workloads with it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1499354495-27173-1-git-send-email-glauber@scylladb.com>
Otherwise we may deadlock, as explained in commit 5e8f0efc8:
Table drop starts with creating a snapshot on all shards. All shards
must use the same snapshot timestamp which, among other things, is
part of the snapshot name. The timestamp is generated using supplied
timestamp generating function (joinpoint object). The joinpoint object
will wait for all shards to arrive and then generate and return the
timestamp.
However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
Message-Id: <1499762663-21967-1-git-send-email-tgrabiec@scylladb.com>
"most of changes are to improve maintainability of the strategy but
the ones that are introduced by the following patches:
lcs: do not check if level 0 can be promoted twice
lcs: remove quadratic behavior from L0 compaction
lcs: partially sort candidates that will be trimmed
lcs: only demote sstable from level higher than target one"
* 'lcs_improvements' of github.com:raphaelsc/scylla: (21 commits)
lcs: only demote sstable from level higher than target one
lcs: improve indentation for get_overlapping_starved_sstables
lcs: improve indentation for get_compaction_candidates
lcs: partially sort candidates that will be trimmed
lcs: remove quadratic behavior from L0 compaction
lcs: introduce private interface
lcs: make some member functions static
lcs: make some functions const qualified
lcs: remove add method
lcs: extract code for higher levels compaction from get_candidates_for
lcs: simplify code to get candidates for higher levels
lcs: extract round-robin heuristic for even distribution of keys into function
lcs: update outdated comments for level 0 compaction
lcs: improve worth_promoting_L0_candidates interface
lcs: do not check if level 0 can be promoted twice
lcs: extract code for level 0 compaction from get_candidates_for
dist/offline_installer: add --skip-setup option to offline installer
dist/offline_installer/debian: install python-minimal package before installing scylla deps
migration_manager: Give empty response to schema pulls from incompatible nodes
migration_manager: Don't pull schema from incompatible nodes
...
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.
Fixes#2432.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
I will split code for higher levels compaction into functions first
before putting it into its own function too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
"Old and new nodes will advertise different schema version because
of different format of schema tables. This will result in attempts
to sync the schema by each of the node.
Currently this will result in scary error messages in logs about
sync failing due to not being able to find schema of given version.
It's benign, but may scare users. It the future incompatibilities
could result in more subtle errors. Better to inhibit it completely."
* 'tgrabiec/fix-schema-pull-errors-during-upgrade' of github.com:cloudius-systems/seastar-dev:
migration_manager: Give empty response to schema pulls from incompatible nodes
migration_manager: Don't pull schema from incompatible nodes
service: Advertise schema tables format version through gossip
So that they are not left on disk even though we did a clean shutdown.
First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.
Fixes#2550.
Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
This change adds the start of what will hopefully be a continually evolving and
improving document for helping developers and contributors to get started with
Scylla development.
The first part of the document is general advice and information that is broadly
applicable.
The second part is an opinionated example of a particular work-flow and set of
tools. This is intended to serve as a starting point and inspire contributors to
develop their own work-flow.
The section on branching is marked "TODO" for now, and will be addressed by a
subsequent change.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <470a542a92aff20d6205fb94b3fb26168735ae6f.1499319310.git.jhaberku@scylladb.com>
Instead of copying and moving the bound, pass it by reference so the
transformer can decide whether it wants to copy or not. The only
caller so far doesn't want a copy and takes the value by reference,
which would be capturing a temporary value. Caught by the
view_schema_test with gcc7.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705210255.29669-1-duarte@scylladb.com>
The state machines generated by antlr allocate many local variables per function.
In release mode, the stack space occupied by the variables is reused, but in debug
build, it is not, due to Address Sanitizer setting -fstack-reuse=none. This causes
a single function to take above 100k of stack space.
Fix by hacking the generated code to use just one variable.
Fixes#2546
Message-Id: <20170704135824.13225-1-avi@scylladb.com>
* tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev:
row_cache: Switch _stats.hits/misses to row granularity
row_cache: Rename num_entries() to partitions() for clarity
row_cache: Track mispopulations also at row level
row_cache: Track row insertions
row_cache: Track row hits and misses
row_cache: Make mispopulation counter also apply for continuity information
row_cache: Add partition_ prefix to current counters
misc_services: Switch to using reads_with[_no]_misses counters
row_cache: Add metrics for operations on underlying reader
row_cache: Add reader-related metrics
row_cache: Remove dead code
"This series introduces selective_token_range_sharder and uses it in repair to
generate dht::token_range belongs to a specific shard."
* tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev:
repair: Use selective_token_range_sharder
tests: Add test_selective_token_range_sharder
dht: Add selective_token_range_sharder
With this change, we ask all the shard to handle the ranges provided by
user and we use selective_token_range_sharder to split the ranges and
ignore the ranges do not belong to the current shard.
The test put a wrapping range into a non-wrapping range variable.
This was harmless at the time this test was written, but newer code
may not be as forgiving so better use a non-wrapping range as intended.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170704103128.29689-1-nyh@scylladb.com>
`r` is moved-from, and later captured in a different lambda. The compiler may
choose to move and perform the other capture later, resulting in a use-after-free.
Fix by copying `r` instead of moving it.
Discovered by sstable_test in debug mode.
Message-Id: <20170702082546.20570-1-avi@scylladb.com>
Currently, lcs will choose, for tombstone compaction, sstable with
the lowest ratio from the ones which ratio is at least above threshold
(0.2 by default).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170703185633.6644-1-raphaelsc@scylladb.com>
* 'lcs_improvements_part_2' of github.com:raphaelsc/scylla:
lcs: Match estimated tasks arithmetic to score in LCS
lcs: prevent leveled_compaction_strategy.hh from being included more than once
lcs: use vector instead for storing a level of sstables
compaction: keep only one variant of size_tiered_most_interesting_bucket
lcs: get rid of unused code in leveled_manifest
Contains fix for CASSANDRA-8904.
Added TARGET_SCORE to get rid of magic number for target score which
is now used more than once.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
list is no longer needed because lcs no longer moves a sstable breaking
invariant at its level to level 0. Now lcs incrementally restores invariant
by compacting together first set of overlapping tables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
two variants of size_tiered_most_interesting_bucket existed to avoid copy,
but subsequent work will make lcs use vector for each level of sstables,
so let's only keep one variant.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Repair today has a semaphore limiting the number of ongoing checksum
comparisons running in parallel (on one shard) to 100. We needed this
number to be fairly high, because a "checksum comparison" can involve
high latency operations - namely, sending an RPC request to another node
in a remote DC and waiting for it to calculate a checksum there, and while
waiting for a response we need to proceed calculating checksums in parallel.
But as a consequence, in the current code, we can end up with as many as
100 fibers all at the same stage of reading partitions to checksum from
sstables. This requires tons of memory, to hold at least 128K of buffer
(even more with read-ahead) for each of these fibers, plus partition data
for each. But doing 100 reads in parallel is pointless - one (or very few)
should be enough.
So this patch adds another semaphore to limit the number of checksum
*calculations* (including the read and checksum calculation) on each shard
to just 2. There may still be 100 ongoing checksum *comparisons*, in
other stages of the comparisons (sending the checksum requests to other
and waiting for them to return), but only 2 will ever be in the stage of
reading from disk and checksumming them.
The limit of 2 checksum calculations (per shard) applies on the repair
slave, not just to the master: The slave may receive many checksum
requests in parallel, but will only actually work on 2 at a time.
Because the parallelism=100 now rate-limits operations which use very little
memory, in the future we can safely increase it even more, to support
situations where the disk is very fast but the link between nodes has
very high latency.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170703151329.25716-1-nyh@scylladb.com>
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.
Let's fix it with a do_with, also making code nicer.
Fixes#2537.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
"compaction_strategy.cc keeps the full implementation of size tiered,
major, and null strategies, and partial implementation of leveled
and date tiered strategies. It's a mess. In the future, we will also
need space for time window strategy. The file is hard to read and
maintain.
My goal here is to improve maintainability of the strategies by
putting each of them into its own header.
NOTE: No semantic change is introduced here."
* 'improve_compaction_strategy_maintainability' of github.com:raphaelsc/scylla:
compaction_strategy: move dtcs to its existing header
compaction_strategy: move lcs implementation to its own header
compaction_strategy: move stcs implementation to its own header
compaction_strategy: move compaction_strategy_impl to its own header
Configuring cpufreq service on VMs/IaaS causes an error because it doesn't supported cpufreq.
To prevent causing error, skip whole configuration when the driver not loaded.
Fixes#2051
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498809504-27029-1-git-send-email-syuu@scylladb.com>
A comment states that we want the file to be old enough, but sets
a timestamp of max(), which is in the future. This may have passed
because the conversion from numeric_limits<time_t>::max() to
db_clock::time_point is not well defined (their dynamic range is
different), so truncation may have converted the large number to a
low one.
Message-Id: <20170702082903.20879-1-avi@scylladb.com>
Error messages incorrectly used the debug representation of the receiver,
rather than the text representation of the operation itself.
Fixes#113.
Message-Id: <20170701101325.3163-1-avi@scylladb.com>
Boot should not continue until a future returned by
wait_for_gossip_to_settle() is resolved. Commit 991ec4a16 mistakenly
broke that, so restore it back. Also fix calls for supervisor::notify()
to be in the right places.
Message-Id: <20170702082355.GQ14563@scylladb.com>
This reverts commit db5bf363d0. Causes
errors of the sort
Exiting on unhandled exception: exceptions::invalid_request_exception
(Keyspace 'system_traces' does not exist)
Previously, lexing and parsing errors were aggregated while CQL queries were
evaluated. Afterwards, the first collected error (if present) would be thrown as
an exception.
The problem was that when parsing and lexing errors were aggregated this way,
the parser would continue even in spite of errors like "no viable alternative".
Semantic actions attached to grammar rules would still execute, though with
variables that had not yet been initialized. This would crash Scylla.
This change modifies the error-handling strategy of CQL parsing. Rather than
aggregate errors, we throw an exception on the first error we encounter. This
ensures that grammar actions never execute unless there is a precise match.
One possible issue with this approach is that the generated C++ code from the
ANTLR grammar may not be exception-safe. I compiled Scylla in debug-mode with
ASan support and executed several erroneous CQL queries with `cqlsh`. No memory
leaks were reported.
Fixes#2466.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <db1f650a2bbb615b506d9015486eece45375a440.1498836703.git.jhaberku@scylladb.com>
compaction_strategy.cc keeps the full implementation of size tiered,
major, and null strategies, and partial implementation of leveled
and date tiered strategies. It's a mess. In the future, we will also
need space for time window strategy. The file is hard to read and
maintain.
My goal here is to eventually improve maintainability of the
strategies by putting each of them into its own header.
This is the first step towards that goal.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This feature is intended to make compaction more efficient at getting rid of
droppable tombstone and expired data wasting disk space. So far, people have
been dealing with it manually through major compaction.
With strategies other than date tiered, large sstables will be left untouched
for a long time even though it's all expired. Date tiered suffers from it when
mixing data with different TTL because it only includes for compaction sstable
that is fully expired.
sstables keeps as metadata a histogram which allows us to easily estimate
droppable data ratio from gc_before. sstables which droppable data ratio is
above 20% (default value for tombstone_threshold option) will be considered
candidates for the operation.
Like in C*, we will only do tombstone removal compaction when there's nothing
to compact in standard way. It would be interesting to trigger it too when
disk usage is above a given threshold, but I decided to leave this for later.
Fixes #2306."
* 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla:
tests: more testing for tombstone compaction options
tests: basic tombstone compaction test for date tiered
compaction/dtcs: add support for tombstone compaction
tests: basic test of tombstone compaction with lcs
compaction/lcs: add support for tombstone compaction
tests: basic tombstone compaction test for size tiered
compaction/stcs: add support for tombstone compaction
tests: add test for estimation of droppable tombstone ratio
sstables: introduce function to estimate droppable tombstone ratio
compaction_manager: periodically submit cfs for compaction
streaming_histogram: fix coding style
tests: add streaming_histogram_test
streaming_histogram: implement sum
tests: add test for sstable with bad tombstone histogram
sstables: discard bad streaming histogram for future use
tests: add sstable tombstone histogram test
streaming_histogram: fix update
streaming_histogram: move it to utils
streaming_histogram: do not limit it to be used by sstables
sstables: update tombstone_histogram for cells with expiration time
Unlike other strategies, dtcs has tombstone compaction disabled by
default due to:
- deletion shouldn't be used with DTCS; rather data is deleted through TTL.
- with time series workloads, it's usually better to wait for whole sstable
to be expired rather than compacting a single sstable when it's more than
20% (default value) expired.
See CASSANDRA-9234 for more details.
For tombstone compaction, unworthy sstables are filtered out and the oldest
one is chosen because it's the one less likely to shadow data and it's also
relatively big.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
LCS will choose its candidate by starting from highest level and
getting sstable which has highest droppable tombstone ratio.
Unlike STCS which needs to choose oldest sstable from biggest tier,
LCS can choose the one with highest d__t__r because sstables in
a given level don't overlap.
Sstable picked up for tombstone removal compaction won't be demoted
or promoted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Larger sstables are hard to find sstable peers and therefore are
left uncompacted for a long time. Expired data and tombstones which
can be purged will waste disk space meanwhile.
sstable tracks droppable tombstone from which ratio can be calculated.
If ratio is greater than threshold (0.2 by default), sstable will
be eligible for compaction. Oldest sstables from biggest tiers are
preferrable because droppable data in them are more likely to satisfy
the conditions for purge, like not shadowing data in another sstable.
Subsequent patches will add support in leveled and date tiered
strategies.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Function used to estimate ratio of droppable tombstone.
A tombstone is considered droppable for cells expired before
gc_before and regular tombstones older than gc_before.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is useful for a column family which isn't generating new content
and will have lots of expired data later on that can be purged.
Compaction submission is NO-OP if there's nothing to do, so I think
it's reasonable to do it at an interval of 1 hour.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This function is used to estimate number of points in interval
[-inf,b]. It will be useful for estimating droppable tombstone
ratio in a given sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Find bad histogram which had incorrect elements merged due to use of
unordered map. The keys will be unordered. Histogram which size is
less than max allowed will be correct because no entries needed to be
merged, so we can avoid discarding those.
This is important because histogram for tombstone will be used to
estimate droppable tombstone ratio. If it's incorrectly high for many
of existing sstables, we will needlessly compact lots of them.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This bug was introduced when converting java code. Return value
of map::erase() was used as if it were the value of the removed
entry, but it's actually the number of removed entries.
update() also relies on ordered keys, so map is used instead
by histogram.
In addition, histograms will be written in sorted order (like C*
does) such that we can detect bad histograms, using disk_array.
disk_array is also used from now on to read histograms.
The conversion from array to map is fine because histograms for
sstables are limited to 100 elements.
Coming patch will detect bad histograms (generated only by us)
and discard them, because we can't rely on their information.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As a preperation for the http stream support, creation of empty reply
should be avoided.
This patch removes a line that cannot be reached but causes the compiler
to complain.
It has no effect aside of removing the reply creation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170628130202.8132-1-amnon@scylladb.com>
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.
Another positive effect is that we don't have to clear continuity now.
Fixes#1999.
Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).
This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.
Introduced in 11b5076b3c.
Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.
Until we can figure out how to reduce its impact, we should disable it.
Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
streaming histogram will later be placed in /utils, so we want
it to use std::unordered_map<> instead of disk_hash<>.
That also requires implementing serialization/deserialization
functions for it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That tombstone_histogram is used to determine droppable data ratio
for a sstable, and unlike C*, we were only updating it for
tombstones. We need to update it with expiration time of cells too,
if any. Creation time (expiration - ttl) cannot be used because if
ttl > gc_grace_seconds, the resulting sstable could be considered
worth dropping by tomstone compaction before any data is actually
expired.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently our build script only supports Ubuntu 14.04/16.04 and Debian 8, this
change extends support to Ubuntu non-LTS versions / Debian non-stable versions.
Note that this is unofficial support, users should build the package for these
distributions theirselves.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498491473-28691-1-git-send-email-syuu@scylladb.com>
When peer nodes have the same partition data, i.e., with the same
checksum, we currently choose to stream from any of them randomly.
To improve streaming performance, select the peer within the same DC.
This patch is supposed to improve repair perforamnce with multiple DC.
Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>
Fixes#2528.
* tag 'asias/gossip_talk_to_more_nodes/v3' of github.com:cloudius-systems/seastar-dev:
gossip: Use vector for _live_endpoints
gossip: Talk to more live nodes in each gossip round
In large clusters with multiple DC deployment, it is observed that it
takes long delay for gossip update to disseminate in the cluster.
To speed up, talk to more live nodes in each gossip round.
Fixes#2528
This patch does the same thing to column_family::make_sstable_reader() as
commit 186f031 did to sstable::as_mutation_source().
Although usually one can fast_forward_to() on the result of a
column_family::make_sstable_reader(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.
With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists. In particular,
column_family::data_query() does a single partition read but does not
specify forwarding::no explicitly.
So this patch returns this optimization, despite this meaning that we
blatently ignore the fwd_mr flag in that case.
Fixes#2524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170626141121.30322-1-nyh@scylladb.com>
"This series enables cache to keep partial partitions.
Reads no longer have to read whole partition from sstables
in order to cache the result.
The 10MB threshold for partition size in cache is lifted.
Known issues:
- There is no partial eviction yet, whole partitions are still evicted,
and partition snapshots held by active reads are not evictable at all
- Information about range continuity is not recorded if that
would require inserting a dummy entry, or if previous entry
doesn't belong to the latest snapshot
- Cache update after memtable flush happening concurrently with reads
may inhibit that reads' ability to populate cache (new issue)
- Cache update from flushed memtables has partition granularity,
so may cause latency problems with large partition
- Schema is still tracked per-partition, so after schema changes
reads may induce high latency due to whole partition needing
to be converted atomically
- Range tombstones are repeated in the stream for every range between
cache entries they cover (new issue)
- Populating scans for both small and large partitions (perf_fast_forward)
experienced a 40% reduction of throughput, CPU bound
How was this tested:
- test.py --mode release
- row_cache_stress_test -c1 -m1G
- perf_fast_forward, passes except for the test case checking range continuity population
which would require inserting a dummy entry (mentioned above)
- perf_simple_query (-c1 -m1G --duration 32):
before: 90k [ops/s] stdev: 4k [ops/s]
after: 94k [ops/s] stdev: 2k [ops/s]"
* tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits)
tests: row_cache: Add test_tombstone_merging_in_partial_partition test case
tests: Introduce row_cache_stress_test
utils: Add helpers for dealing with nonwrapping_range<int>
tests: simple_schema: Allow passing the tombstone to make_range_tombstone()
tests: simple_schema: Accept value by reference
tests: simple_schema: Make add_row() accept optional timestamp
tests: simple_schema: Make new_timestamp() public
tests: simple_schema: Introduce make_ckeys()
tests: simple_schema: Introduce get_value(const clustered_row&) helper
tests: simple_schema: Fix comment
tests: simple_schema: Add missing include
row_cache: Introduce evict()
tests: Add cache_streamed_mutation_test
tests: mutation_assertions: Allow expecting fragments
mutation_fragment: Implement equality check
tests: row_cache: Add test for population of random partitions
tests: row_cache: Add test for partition tombstone population
tests: row_cache: Test reading randomly populated partition
tests: row_cache: Add test_single_partition_update()
tests: row_cache: Add test_scan_with_partial_partitions
...
Runs readers, updates and eviction concurrently and verifies the
following property of reads:
- reads see all past writes
- reads see no partial writes within a single partition
[tgrabiec:
- extracted from a larger commit
- removed coupling with how cache_streamed_mutation is created (the
code went out of sync), used more stable make_reader(). it's simpler too.
- replaced false/true literals with is_continuous/is_dummy where appropraite
- dropped tests for cache::underlying (class is gone)
- reused streamed_mutation_assertions, it has better error messages
- fixed the tests to not create tombstones with missing timestamps
- relaxed range tombstone assertions to only check information relevant for the query range
- print cache on failure for improved debuggability
]
Those methods first create a neutral mutation_partition, and left-fold
it with the versions. The problem is that there is no neutral element
for static row continuity, the flag from the first addend always
wins. We have to copy the flag from the first version to preserve
the logical value.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec:
- extracted from a larger commit
- fix heap comparator in apply_incomplete_target to order versions properly
- extracted partition_version detaching into
partition_entry::with_detached_versions()
- dropped unnecessary rows_iterator::_version field
- dropped unnecessary allocation of rows_entry and key copies
in rows_iterator
- dropped row_pointer
- replaced apply_reversibly() with weaker and faster apply()
- added handling of dummy entries at any position
- fixed exception safety issue in apply_to_incomplete() which may
result in data loss. We cannot move data out of applied versions
into a new synthetic row and then apply it, because if exception
happens in the middle, the data which was moved from the source
will be lost. To fix that, row_iterator::consume_row() is
introduced which allows in-place consumption of data without
construction of temporary deletable_row.
]
This streamed mutation populates cache with
the rows requested by the read. It takes whatever
it can find in the cache and fetches the remainings
from underlying source.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec:
- fixed maybe_add_to_cache_and_update_continuity() leaking entries if
the key already exists in the snapshot
- fixed a problem where population race could result in a read
missing some rows, because cache_streamed_mutation was advancing
the cursor, then deferring, and then checking continuity. We
should check continuity atomically with advancing.
- fixed rows_handle.maybe_refresh() being accessed outside of update
section in read_from_underlying() (undefined behavior)
- fixed a problem in start_reading_from_underlying() where we would
use incorrect start if lower_bound ended with a range tombstone
starting before a key.
- range tombstone trimming in add_to_buffer() could create a
tombstone which has too low start bound if last_rt.end was a
prefix and had inclusive end. invert_kind(end_kind) should be used
instead of unconditional inc_start.
- range tombstone trimming incorrectly assumed it is fine to trim
the tombstone from underlying to the previous fragment's end and
emit such tombstone. That would mean the stream can't emit any
fragments which start before previous tombstone's end. Solve with
range_tombstone_stream.
- split add_to_buffer() into overloads for clustering_row, and
range_tombstone. Better than wrapping into mutation_fragment
before the call and having add_to_buffer() rediscover the
information.
- changed maybe_add_to_cache_and_update_continuity() to not set
continuity to false for existing entries, it's not necessary
- moved range tombstone trimming to range_tombstone class
- moved range tombstone slicing code to range_tombstone_list and partition_snapshot
- can_populate::can_use_cache was unused, dropped
- dropped assumption that dummy entries are only at the end
- renamed maybe_add_to_cache_and_update_continuity() to maybe_add_to_cache()
- dropped no longer needed lower_bound class
- extracted row_handle to a seaparate patch
- made the copy-from-cache loop preemptable
- split maybe_add_next_to_buffer_and_update_continuity(bool)
- dropped cache_populator
- replaced "underlying" class with use of read_context
- replaced can_populate class with a function
- simplified lsa_manager methods to avoid moves
]
The interaction will be as follows:
- Before creating cache_streamed_mutation for given partition, cache
mutation reader sets up read_context for current partition (in one
of two ways) so that the matching underlying streamed_mutation can
be accessed at any time by cached_stream_mutation.
- cache_streamed_mutation assumes that read_context is set up for
current partition and invokes fast_forward_to() and
get_next_fragment() to access the underlying
streamed_mutation.
When reading from incomplete partition entry, we may discover we need
to read something from the underlying mutation source. In such case we
will fast forward this reader to that partition. But we must do it
using a specific snapshot, the one we obtained when entering the
partition, not the latest one.
We will need to use this information later in yet another place, when
creating a reader for incomplete cache entry. This refactors the code
so that there is a single place which determines this fact.
Currently scanning_and_populating_reader asks
just_cache_scanning_reader for the next partition from cache, together
with information if the range is continuous. If it's not, it saves the
partition it got from it and moves on to reading from the underlying
reader up to that partition. When that's done, it emits the stored
partition.
This approach won't work well with upcoming changes for storing
partial partitions. We won't have whole partitions any more, so
streamed_mutation returned for the entry needs to be prepared for
reading from the underlying mutation source. We want to reuse the same
underlying reader as much as possible, so all streamed_mutations for
given read (read_context) will share the state of the underlying
reader. Construction of a streamed_mutation will depend on the fact
that the shared state is set up for it, so we cannot have two
streamed_mutations prepared at the same time (one for entry from
primary, and one for the earlier entry being populated). This change
defers the creation of a streamed_mutation for the entry present in
cache until the whole reader reaches it to avoid this problem.
This will also have antoher potentially beneficial effect. Since we
defer the decision about which snapshot to use until we reach the
entry, there is a higher chance that the current snapshot of the entry
will match the one used last by the populating read, and that we will
be able to reuse the reader.
It's implemented by utilizing a stable partition cursor which tracks
its current position so that it's possible to revisit the cache entry
(if it's still there) after population ends. The functionality of
just_cache_scanning_reader was inlined into
scanning_and_populating_reader.
dht::token needs to be stored as a pointer now and not a reference so
that validity of old pointers is not impacted by in-place object
construction which would occur in the copy-assignment operator.
[1] says that old pointers can be used to access the new object only
if the type "does not contain any non-static data member whose type is
const-qualified or a reference type".
[1] http://en.cppreference.com/w/cpp/language/lifetime#Storage_reuse
It's needed before switching cache_entry ordering to rely solely on
cache_entry::position() so that invalidate_unwrapped() never removes
the dummy entry at the end. Currently if the range has upper bound
like this:
{ ring_position::max(), inclusive=true }
The code which selects entries for removal would include the dummy row
at the end. It uses upper_bound() to get the end iterator, and the
dummy entry has a position which is equal to the position in the
bound.
ring_position_view ranges are end-exclusive, so it's impossible to
create a partition range which would include a dummy entry.
The code is also simpler.
This will allow conversion from streamed_mutation that
supports fast forwarding to streamed_mutation that does not.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.
Consider the following case:
flush write(ck=1), write(ck=2) -> snapshot_1
cache reader 1 reads and inserts ck=1 @snapshot_1
flush write(ck=1), write(ck=2) -> snapshot_2
cache reader 2 reads and inserts ck=2 @snapshot_2
Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.
To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.
There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.
After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.
Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
Currently every time cache needs to create reader for missing data it
obtains a reader which is most up to date. That reader includes writes
from later populate phases, for which update() was not yet
called. This will be problematic once we allow partitions to be
partially populated, because different parts of the partition could be
partially populated using readers using different sets of writes, and break
write atomicity.
The solution will be to always populate given partition using the same
set of writes, using reader created from the current snapshot. The
snapshot changes only on update(), with update() gradually converting
each partition to the new snapshot.
This violation of the contract is currently benign, because there are
no reads from those tables before they are populated. If there were,
the cache would mark the whole (empty) range as continuous and the
table would appear empty.
It will cause similar problem once cache starts using snapshots of the
underlying mutation source. Then this lack of invalidate() will also
result in cache thinking that the table is still empty.
1) Reduce duplication by delegating to more general overloads
2) Improve documentation to not mention effects in terms of
population (detail) but rather write visibiliy
3) Rename clear() to invalidate() and merge with the range variant,
it has the same semantics
This object stores all read relevant context required all
over the place. This leads to a cleaner code.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec:
- made read_context shareable to allow storing shared
mutable state later
- added range and cache getters
]
This is an abstraction that represents a reader
to the underlying source and auto updates itself
to make sure the reader reflects the latest state
of the underlying source.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec: Add range getter to avoid friendships]
[tgrabiec:
- Extracted from a different patch
- Renamed concept names to more familiar Map and Reduce
- Renamed aggregate() to squashed() to match the existing nomenclature
- Uncommented the concepts
]
Currently mutation sources are free to return range tombstones
covering range which is larger than the query range. The cache
mutation source will soon become more eager about trimming such
tombstones. To cover up for such differences, allow telling the
restrictions to only care about differences relevant for given
clustering ranges.
This will be used by partial cache in later patches.
[tgrabiec:
- changed title,
- documented meaning of the variable,
- renamed the variable,
- introduced open_version(),
- fixed continuity of the static row not being preserved in case
a new version is created]
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
mutation source sometimes ignore fast forwarding parameter so
this change adds assertion to check that this parameter
can be safely ignored.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
By default make_reader_returning creates a reader that does not
support fast forwarding but the second parameter can be used to
make it support fast forwarding.
[tgrabiec: Improve title]
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This will allow expressing lack of information about certain ranges of
rows (including the static row), which will be used in cache to
determine if information in cache is complete or not.
Continuity is represented internally using flags on row entries. The
key range between two consecutive entries is continuous iff
rows_entry::continuous() is true for the later entry. The range
starting after the last entry is assumed to be continuous. The range
corresponding to the key of the entry is continuous iff
rows_entry::dummy() is false.
[tgrabiec:
- based on the following commits:
4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry
773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry
- documented that partition tombstone is always complete
- require specifying the partition tombstone when creating an incomplete entry
- replaced rows_entry(dummy_tag, ...) constructor with more general
rows_entry(position_in_partition, ...)
- documented continuity semantics on mutation_partition
- fixed _static_row_cached being lost by mutation_partition copy constructors
- fixed conversion to streamed_mutation to ignore dummy entries
- fixed mutation_partition serializer to drop dummy entries
- documented semantics of continuity on mutation_partition level
- dropped assumptions that dummy entries can be only at the last position
- changed equality to ignore continuity completely, rather than
partially (it was not ignoring dummy entries, but ignoring
continuity flag)
- added printout of continuity information in mutation_partition
- fixed handling of empty entries in apply_reversibly() with regards
to continuity; we no longer can remove empty entries before
merging, since that may affect continuity of the right-hand
mutation. Added _erased flag.
- fixed mutation_partition::clustered_row() with dummy==true to not ignore the key
- fixed partition_builder to not ignore continuity
- renamed dummy_tag_t to dummy_tag. _t suffix is reserved.
- standardized all APIs on is_dummy and is_continuous bool_class:es
- replaced add_dummy_entry() with ensure_last_dummy() with safer semantics
- dropped unused remove_dummy_entry()
- simplified and inlined cache_entry::add_dummy_entry()
- fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous
]
If some row entries may have to be skipped by the reader then it could
be that _clustering_rows is not empty, but read_next() will return a
disengaged optional because there are no more rows in the current
range. The code assumed that it's never the case, and if read_next()
returns a disengaged optional then we exhousted all ranges. Before
introducing dummy entries this needs to be refactored.
key() will not be valid for dummy entries, but position() is always
valid.
[tgrabiec: Extracted from other commits]
[tgrabiec: Added missing change to range_tombstone_stream::get_next]
This is a position that's always in the end after any
other position. It will be used for dummy rows_entry.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It can happen that touch() will trigger eviction on entry to
allocating section, and drop in occupancy around insertion
will not happen. As a result, we may evict a lot without detecting that.
Extend the check to include touch() and use more reliable eviction counters.
We support situations where `seastar.pc` is available, and when it isn't.
When `seastar.pc` is available, we grab the compilation flags from it in
addition to the defaults.
Some DPDK files are statically available in the source repository. We prefer
those to files placed during compilation in case modifications are made during
development (that would be lost during a build).
We always disable GCC6 concepts for the IDE, even if they are enabled while
configuring the project for compilation. CLion's parser doesn't understand them.
One final benefit to this revision is that now only target-specific flags are
modified rather than global flags.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <75457652201c5ed89d05081ec4b56b4340721cf5.1498237756.git.jhaberku@scylladb.com>
"This patch series makes two noteworthy changes:
- `gc_clock` now uses `seastar::lowres_system_clock` for better performance.
- Code common to all clocks is properly refactored and shared.
Otherwise, there are some small improvements that should have no functional impact."
* 'jhk/lowres_gc_clock/v1' of https://github.com/hakuch/scylla:
Seal clock definitions
`timestamp_clock::now()` is not `noexcept`
Make `db_clock` `time_t` conversions `constexpr`
Move common clock implementation helpers
Simplify clock implementations
db_clock.hh: Clean preprocessor directives
Make `gc_clock` a model of `Clock`
Use `lowres_system_clock` to back `gc_clock`
Add `time_t` conversions for `gc_clock`
The problem is that `std::chrono::duration_cast` is not `noexcept`. As a result,
`timestamp_clock` is actually a model of `Clock` and not `TrivialClock`.
This change fixes the dependencies between the clock implementation headers. All
the clocks share the common clock offset, but are otherwise independent (though
the `db_clock` does depend on `gc_clock` for time point conversions).
`gc_clock` reports system time, and these conversion functions allow for
manipulating time points produced by the clock without making assumptions about
its epoch.
"Fix minor issues found while building with clang trunk."
* 'clang' of https://github.com/avikivity/scylla:
seastarx: don't make seastar namespace inline
seastarx: add missing make_shared forward declaration
tests: fix call to seastar::sleep()
dht: fix bad to_sstring() call
* 'lcs_improvements_v1' of github.com:raphaelsc/scylla:
lcs: remove useless code for choosing L0 candidates
lcs: remove some dead code
lcs: make logger static
lcs: actually prefer oldest sstables of L0 when it falls behind
lcs: remove useless expensive check for overlapping L1 sstables
so now user can look at nodetool compactionstats and determine
whether or not resharding is running, for example:
$ ./bin/nodetool compactionstats
pending tasks: 3
id compaction type keyspace table completed total unit progress
<none> RESHARD system compaction_history 11 256 keys 4.30%
<none> RESHARD system compaction_history 2 256 keys 0.78%
<none> RESHARD system compaction_history 10 256 keys 3.91%
<none> RESHARD system compaction_history 8 256 keys 3.12%
<none> RESHARD system compaction_history 7 256 keys 2.73%
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170620175733.25882-1-raphaelsc@scylladb.com>
The code filters CFs by name to not include system keyspace, but v3
schema added yet another system namespace. Better filter according to
replication strategy to accommodate for schema v4 adding even more
system keyspaces.
Fixes: #2516
Message-Id: <20170621073816.GB3944@scylladb.com>
This version optionally reads the include paths for Seastar from pkg-config and
uses file globbing to register all source and header files.
In comparison to the previous version, I see all files in the project explorer
view are "active" (rather than just .cc files). I believe there are also fewer
errors reported by the editor.
Add support Debian new stable release.
Also including following changes:
- update libthrift due to unable to compile on Debian 9
- drop dist/debian/supported_release since distribution check code moved to pbuilderrc
- add libssl-dev for build-depends
- add sudo for pbuilder extra packages (Debian doesn't have it by default install)
Signed-off-by: syuu <syuu@dokukino.com>
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498047515-19972-1-git-send-email-syuu@scylladb.com>
The memtable destructor can take a long time if the memtable is full; use
clear_gently() to clear it without impacting latency.
Fixes#2477.
Message-Id: <20170620093550.16121-1-avi@scylladb.com>
Although usually one can fast_forward_to() on the result of a
sstable::as_mutation_source(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.
With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists... So this patch returns
this optimization, despite this meaning that we blatently ignore
the fwd_mr flag in that case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170620081107.14335-1-nyh@scylladb.com>
* seastar 621b7ed...9e2b7ec (8):
> Merge "Low-resolution clocks" from Jesse
> build: disable -Wattributes when gcc -fvisibility=hidden bug strikes
> build: work around ragel 7 generated code bug
> rpc: make unmarshall_exception() inline
> future-utils: make functions global
> fix reactor stall detector rate limiting on an mostly idle system
> prometheus: Add ability to add /metrics to any http_server
> prometheus: fix memory leak in http_server_control
multi_range_mutation_reader uses fast_forward_to() to skip between
ranges, so we always need to create the underlying reader with with
mutation_reader::forwarding::yes if there is more than one range,
irrespective of whether multi_range_mutation_reader itself will be
forwarded or not.
Fixes#2510.
Introduced in commit 3018df1.
Message-Id: <1497943032-18696-1-git-send-email-tgrabiec@scylladb.com>
The code being removed could be used if parallel compaction were
allowed for LCS, but the current code isn't even allowing that.
At the moment, it's only wasting cycles.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Strategy prefers promoting oldest sstables in L0. Because sort
procedure is incorrectly sorting elements in descending order,
newest sstables will be promoted first *if and only if* L0 falls
behind (more than 32 sstables). If L0 doesn't fall behind, we'll
have all L0 sstables compacted with overlapping ones in L1.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
there's no way a L1 sstable will be in candidates set which was
previously built from list of L0 sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Enable pbuilder for Ubuntu/Debian to prevent build enviroment dependent issues.
Also support cross building by pbuilder.
(cross-building from Fedora 25 and Ubuntu 16.04 are tested)
closes#629
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497895661-26376-1-git-send-email-syuu@scylladb.com>
As Avi noticed, the "forwarding_tag" which was meant to be local in
streamed_mutation, became global. If another class copied the same trick,
it would share the same type instead of being distinct types as intended.
The problem is that in:
using forwarding = bool_class<class forwarding_tag>;
Apparently, the "class forwarding_tag" forward-declares a global type - it
does not create a local-scope type as intended, which the following apparently
does (even though no actual definition is given for that class):
class forwarding_tag;
using forwarding = bool_class<forwarding_tag>;
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619153933.13116-1-nyh@scylladb.com>
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
We mistakenly placed ./configure.py to dh_auto_build, but it's should place at
dh_auto_configure.
This bug causes issue #2505, since we haven't defined dh_auto_configure task yet(It seems running cmake on top of the dir is one of default behavior of dh_auto_configure).
Fixes#2505
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497866068-32097-1-git-send-email-syuu@scylladb.com>
This patch add the CMakeLists.txt file for IDEs based on cmake, like
CLion.
This file assumes the existence of a build/release/gen directory,
containing generated files.
Refs #867
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170618151333.94714-1-duarte@scylladb.com>
"This patch set ensures we quote the name of a UDT when it
contains characters that may cause parsing by the CQL parser
to fail.
Fixes#2491"
* 'cql3-quote-type/v1' of https://github.com/duarten/scylla:
cql3/util: Make maybe_quote() take argument by const reference
cql3/cql3_type: Quote UDT name if needed
schema: Lift maybe_quote() into cql3/util
Boost 1.55 (ubuntu 14) fails to compile because an iterator produce by
boost::adaptors::transformed() when std::ref to lambda is passed to
it do not match iterator concept. It cannot be default constructed
because std::reference_wrapper is not default constructable.
boost::range::min_element() never actually default construct it, but
concept is checked anyway. The patch fixes it by providing an explicit
functor that is default constructable.
Message-Id: <20170618131836.GD3944@scylladb.com>
When making the schema mutations for a view update, we should only
include the base table schema mutations (in case the target node
doesn't contain them) when the view is being directly updated. When it
is being updated as a side effect of updating the base table, then
including the base schema mutations will hide the actual changes being
performed on the base.
Fixes#2500
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2). It causes crashes
during range scans, reported by Gleb:
"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.
Backtrace:
at /home/gleb/work/seastar/seastar/core/apply.hh:36
rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
at ./seastar/core/future.hh:890
at /home/gleb/work/seastar/seastar/core/future-util.hh:119
at /home/gleb/work/seastar/seastar/core/future-util.hh:142
Add a performance test for deletion in addition to the existing update
and query tests. The deletion performance test is executed using the
'--delete' argument to perf_simple_query.
Fixes#2417.
Signed-off-by: Etienne Kruger <el@loadavg.io>
Message-Id: <20170615232500.26987-1-el@loadavg.io>
This patch ensures we properly quote a UDT name, which may contain
characters like ".", which can lead the name to be interpreted as a
keyspace qualified name when parsed by the CQL parser.
Fixes#2491
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The typo went unnoticed since the compiler picked up the global scope's
forwarding_tag. The bug made streamed_mutation::forwarding and
mutation_reader::forwarding the same type, but fortunately there were
no type mixups due to this.
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
"During read query with CL<ALL not all replicas are contacted. It is
possible for some replicas to cache less data for some CF's (for instance
because of node restart), so the replica choice may have a big impact
on request's completion latency and on amount of work it generates in
a cluster.
This patch series keep track of per CF cached hit ratio and uses this
information to choose best replicas for a request. Nodes with lower
hit ratios are still contacted in order to populate their cache, but
less frequently."
* 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev:
storage_proxy: load balance read requests according to cache hit rates
choose extra replica for speculation in filter_for_query()
consistency_level: drop filter_for_query_dc_local function
database: reset node's hit rate information on connection drop
messaging_service: connection drop notifier
Store cluster wide cache hit statistics in CF
messaging_service: return cache hit ratio as part of data read
Distribute cache temperature over gossiper.
periodically calculate avg cache hit rate between all shards
database: introduce cache_temperature class
Rename load_broadcaster.cc to misc_services.cc
storage_proxy: use db::count_local_endpoints function instead open code it
"This series reduces repair memory usage and improves repair speed."
* tag 'asias/fix-repair-2430-branch-master-v4.1' of github.com:cloudius-systems/seastar-dev:
repair: Repair on all shards
repair: Allow one stream plan in flight
Range tombstones were introduced in version 1.3 and there exists
no direct upgrade from 1.2 to vnext, so we can retire the code
enforcing backwards compatibility.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170614211654.82501-1-duarte@scylladb.com>
Currently commitlog_entry_writer constructor calculates serialized size
before it is knows if a schema should be included into the entry. The
result is never used since it is recalculated when schema information is
supplied. The patch removes needless calculation.
Message-Id: <20170614114607.GA21915@scylladb.com>
Currently each time UDT or tuple is parsed new object is created. If
those objects are used to create container type repeatedly it will cause
memory leak since container types are interned, but lookup in the
cache is done using pointer to a contained type (which will be always
different for UDT and tuples). This patches interns also UDT and tuple,
so each type the same object is parsed same pointer is also returned.
Refs #2469Fixes#2487
Message-Id: <20170612142942.GO21915@scylladb.com>
Currently, shard zero is the coordinator of the repair. All the work of
checksuming of the local node and sending of the repair checksum rpc
verb is done on shard zero only. This causes other shards being
underutilized.
With this patch, we split the ranges need to be repaired into at least
smp::count ranges, so sizeof(ranges) / smp::count will be assigned to
each shard. For exmaple, we have 8 shards and 256 ragnes, each shard
will repair 32 ranges. Each shard will repair the 32 ranges
sequencially. There will be at most 8 (smp::count) ranges of repair in
parallel.
In "repair: Use more stream_plan" (commit 2043ffc064), we
switched to do stream while doing checksum instead of do stream only
after checksum pahse is completed. We take a parallelism_semaphore
before we do checksum, if there are more than sub_ranges_to_stream
(1024) ranges, we start a stream_plan and wait for the streaming to
complete (still under the parallelism_semaphore). So at most
parallelism_semaphore (100) stream_plans can be in parallel.
The parallelism_semaphore limits the parallelism of both checksum and the
streaming plan. However, it is not necessary to have the same
parallelism for both checksum and streaming, because 1) a streaming
operation itself runs in parallel (handling ranges on all shards in
prallel, sending mutaitons in parallel) , 2) and with more streaming plan
(in worse case 100) means we can write to 100 memtables at the same time
and flush 100 memtables to disk at the same time which can take a lot of
memory.
With this patch, we only allow one stream plan in flight.
If we do two truncates in a row, the second will have neither memtable
nor sstable data. Thus we will not write/remove sstables, and thus
get no resulting truncation replay position.
Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>
This patch makes storage proxy to choose replicas to read from base on
their cache hit rates. Replicas with higher cache hit rates will see
more requests while replicas with lower hit rates will see less. Local
node has a special bonus and will get more requests even if another node
has slightly higher cache hit rate (same goes for local vs remote DC),
but after the patch it is no longer guarantied that a coordinator node
will be chosen as a replica for the read (if the feature is enabled).
Currently storage proxy has to loop over remaining replicas to search
for suitable extra replica, but doing it in filter_for_query() is
extremely easy, so do it there instead.
Merge filter_for_query_dc_local() functionality into filter_for_query().
This is more efficient since filter_for_query_dc_local() partitions
endpoints into 'local' and 'remote' set but filter_for_query() already
does it for CL=LOCAL so for such queries we needlessly do it twice.
Node may go down, so after it restarts cache hit rate info will be
incorrect and it can be overwhelmed with traffic until new and
up-to-date cache hit rate arrives. Solve this by dropping node's
information on connection reset, it is more accurate than relying on
gossip which may be slow and miss reboot of a node.
When a node start it does not have any information about cache temperature
of other nodes in the cluster and it is hard (if not impossible) to make
right guess. During cluster startup all nodes have cold caches, so there
is no point to redirect reads to other nodes even though local cache it
cold, but if only that node restarted than other nodes have populated
cache and reads should be redirected.
The node will get up-to-date information about other nodes caches,
but only after receiving first reply, until then it does not have the
information to make right decisions which may cause unwanted spikes
immediately after restart. Having cache temperature in gossiper helps
to solve the problem.
This patch adds new class cache_hitrate_calculator whose responsibility
is to periodically calculate average cache hit rates between all shards
for each CF.
end_bound() returns temporary object (end_bound_ref), so it cannot be
taken by reference here and used later. Copy instead.
Message-Id: <20170612132328.GJ21915@scylladb.com>
Apparently some GDB versions (7.11.1-86.fc24) don't parse double '>'
in a type name, so this:
std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family>>
should be this:
std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family> >
Message-Id: <1497256644-4335-1-git-send-email-tgrabiec@scylladb.com>
The problem is that 'key' is a 'bytes' object now, which doesn't have __format__.
Fixes the following error:
Traceback (most recent call last):
File "~/src/scylla/scylla-gdb.py", line 184, in invoke
TypeError: non-empty format string passed to object.__format__
Error occurred in Python command: non-empty format string passed to object.__format__
Message-Id: <1497253433-374-2-git-send-email-tgrabiec@scylladb.com>
"This series switches repair to use more stream plans to stream the mismatched
sub ranges and use a range generator to produce sub ranges.
Test shows no huge memory is used for repair with large data set.
In addition, we now have a progress reporter in the log how many ranges are processed.
Jun 06 14:18:22 [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
Jun 06 14:19:55 [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
Fixes #2430."
* tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev:
repair: Remove unused sub_ranges_max
repair: Reduce parallelism in repair_ranges
repair: Tweak the log a bit
repair: Use more stream_plan
repair: iterator over subranges instead of list
From seastar-dev.git calle/concorde
Normally, we require that all mutations applied to a column family
have replay positions higher than all previously flushed.
The main reason for this is to be able to determine when to drop a
commit log segment, i.e. determine that all replay positions less
than X are now in sstables.
This patch series, small as it is, relaxes this by instead of just
keeping track of high rp applied, keep a reference count to each
segment per CF in memtables, and on flush, release this very count.
The only case where we need to keep a water mark for RP is then
for table truncation, for which we simply say that the highest RP
applied to the column family is the lowest allowed henceforth,
and use the old reordering logic for this instead. I.e. very rare.
There is of course one (big?) downside to all this, and this is
"normal" commit log replay on startup after crash/shutdown.
Since we relax RP ordering, we cannot use RP:s in sstables as
low marks for replay start, since it is now allowed to exist
non-persisted mutations in commitlog with lower RP:s than
previously flushed. I.e. we more or less always have to replay
the full commit log.
It is worth noting though that due to compaction and the non-
propagation of RP marks to new sstables, we end up often
doing this anyway, so it is hard to say how much of a regression
this is.
truncate
With commitlog keeping use-count per CF id, we can ease the ordering
restriction on replay positiontion. Previously we required that all
added mutations have a position > previously flushed. However, if
we accept that replay must now be all data, by keeping track instead
per CF of highest RP ever entered, we can instead just set a
low mark on truncation, since this is the only remaining hard
RP divider.
Use per CF-id reference count instead, and use handles as result of
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction.
Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
end and inbetwen. I.e. the test sort of assumed we started
with < 2 (or so) segments. Not always the case (timing)
Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
We currently repair all the ranges in parallel.
1) All the ranges will contend for parallelism_semaphore, instead of
processing multiple ranges in parallel and calculating the sub ranges
(which take memory) for each range in parallel, we can handle the ranges
one bye one.
We could have enough parallelism because the checksum are calucated on
all the shards.
2) If for some reason the repair failed, if we handle ranges 1 by 1, we
can log which range of repair is successful. Next time, we can ignore
them. If we start ranges in parallel, it has a high chance, no single
range is completed because all the ranges are on going.
Refs #1912
- Count n out m ranges the repair is running for (kind of progress report)
- Make the 'Found differing range' log debug because it can be millions
of such entries
- Print the failed ranges
In the very beginning, we use a stream_plan for each checksum range.
Later, we changed to use a single stream_plan for all the checksum
ranges. It pushes memory presure to streaming, e.g., millinons of ranges
in a vector to send over RPC.
To fix, we do checksum and streaming in parallel, limit the number of
checksum ranges stored in memory.
Fixes#2430
When starting repair, we divided the large token ranges (vnodes) linto small
subranges of a desired length (around 100 partition), and built a huge list
of those subranges - to iterate over them later and compare checksums of
those chunks.
However, building this list up-front is completely unnecessary, and wastes
a lot of memory: In a test with 1 TB of data, as much as 3 gigabytes was
spent on this list. Instead, what we do in this patch is to find the next
chunk in a DFS-like splitting algorithm, using only the token range
midpoint() function (as before). The amount of memory needed for this is
O(logN), instead of O(N) in the previous implementation.
Refs #2430.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
After change in boot, read_filter is called by distributed loader,
so its update to _filter_file_size is lost. The load variant
which receives foreign components that must do it. We were also
not updating it for newly created sstables.
Fixes#2449.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170606151129.5477-1-raphaelsc@scylladb.com>
At least on Debian8, mk-build-deps -i silently finishes with return code 0
even it fails to install dependencies.
To prevent this, we should manually install the metapackage generated by
mk-build-deps using gdebi.
Fixes#2445
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496737502-10737-2-git-send-email-syuu@scylladb.com>
Installing openjdk-8-jre-headless from jessie-backports breaks texlive on
jessie main repo.
It causes 'Unmet build dependencies' error when building gdb package.
To prevent this, force insatlling texlive from jessie-backports before start
building gdb.
Fixes#2444
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496737502-10737-1-git-send-email-syuu@scylladb.com>
"This series fixes some issues with the thrift_server, namely
ensuring that streams and sockets are properly closed.
Fixes#499Fixes#2437"
* 'thrift-server-fixes/v1' of github.com:duarten/scylla:
thrift/server: Close connections when stopping server
thrift/server: Move connection class to header
thrift/server: Shutdown connection
thrift/server: Close output_stream when connection is done
This patch adds the shutdown() function to thrif_server::connection,
and calls it after a connection is done.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The column mask identifies the kind of atom in a row in an sstable. Two
definitions of these values were present: one as a C-style enumeration and one
as a C++11-style enumeration.
The C++11-style definition is used elsewhere in `sstables.cc`. It also offers
additional type-safety.
Therefore, this commit removes the inlined C-style enumeration.
Fixes#2214.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <c525b4ae7fad3b54480e133921aa4ffe0dd5d9ce.1496352711.git.jhaberku@scylladb.com>
To reduce unwanted dependencies, we need to replace dependency from collectd to
collectd-core.
However, collectd provides /etc/collectd/collectd.conf, so without this package
we need to install the configuration file by our self.
So install the file on .postinst script.
Fixes#2426
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496231743-7828-1-git-send-email-syuu@scylladb.com>
In the write path we don't wait for view updates, as they happen in
the background.
The view schema tests can fail when running with more than one cpu due
to this inherent race condition: the write to the base table returns
while the view updates are still being processed, after which we issue
a query to the view table. The shard handling the view data is not
guaranteed to finish processing the mutation before handling the query.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170531165726.9212-1-duarte@scylladb.com>
invariant is broken if size of L0 candidates is equal to max
sstable size because the overlapping L1 sstables will not be
added to compacting set, and they will be promoted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170530143708.3775-1-raphaelsc@scylladb.com>
* seastar 68dbf60...b1f69cc (10):
> metrics: fix namespace in documentation
> add special logger for memory allocation failures
> xen: remove
> Merge "sanity checks, fixes and extensions in the perftune.py" from Vlad
> tutorial: more "seastar" namespace
> execution_stages: fix build errors in comments
> tutorial: more "seastar" namespace additions
> tutorial: more minor changes
> tutorial: minor changes to the introduction
> tutorial: start overhauling the examples to use "seastar" namespace
Mimic origin behaviour, iff TLS encryption is enabled, and
native_transport_port_ssl is set and different from
native_transport_port, start both tls- and non-tls
listeners.
Message-Id: <1496061600-24454-2-git-send-email-calle@scylladb.com>
Removing non-murmur3 partitioners will allow us to reduce memory footprint
and speed up some code by utilizing the properties of the murmur3 partitioner
token.
Message-Id: <20170528172536.16079-1-avi@scylladb.com>
"
- Introduce a parent span IP and span ID paradigm.
- Introduce time series tables to simplify traces processing.
- Add the "How to get traces?" chapter to the tracing.md.
"
* 'tracing-span-ids-and-time-series-helpers-v4' of github.com:cloudius-systems/seastar-dev:
docs: tracing.md: add a "how to get traces" chapter
tracing::trace_keyspace_helper: introduce a time series helper tables
tracing: cleanup: use nullptr instead of trace_state_ptr()
tracing: introduce a span ID and parent span ID
Currently, start and end size of compaction are calculated using the
uncompressed size of data component. bytes_on_disk() returns size
used by all components.
NOTE: start and end size are written to compaction history, so users
who monitor it should be aware of this change.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525212129.6758-1-raphaelsc@scylladb.com>
sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.
Fixes#1592
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
We can make the dependency more abstract by using mutation_source
instead of an sstable.
Will be useful in some stress tests which want to avoid the disk, but
is also good for the sake of decoupling.
Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>
Information about interrupts is invaluable when debugging performance
problems with Scylla in the field. node_exporter doesn't include that in
the list of collectors enabled by default, so we suggest we do it here.
The list that goes into this file is the default list as shown by
node_exporter, with "interrupts" added to it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170525153008.26720-1-glauber@scylladb.com>
"lcs' current behavior will make it hard to reduce number of
levels by increasing sstable size because it uses uncompressed
length when deciding whether or not a level needs promotion.
Demotion process is slower because of that."
* 'lcs_promotion_improvement_2' of github.com:raphaelsc/scylla:
lcs: use sstable compressed length when computing level size
sstables: introduce sstable::ondisk_data_size
sstables: unconditionally set sstable data file size
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.
Fixes#2407.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
lcs uses uncompressed length of sstables when computing size of a
level, and that may result in unnecessary promotion when the goal
is to reduce the number of levels after an increase in sstable
size, for example.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this new function is an alternative to data_size(), which will
return size of data component. data_size() returns uncompressed
size of data component if compression is enabled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Streaming ususally takes long time to complete. Abort it on false
positive idle detection can be very wasteful.
Increase the abort timeout from 10 minutes to a very large timeout, 300
minutes. The real idle session will be aborted eventually if other
mechanisms, e.g., streaming manager has gossip callback for on_remove
and on_restart event to abort, do not abort the session.
Fixes#2197
Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>
Use the boost::intrusive containers in order to achieve a O(1) complexity
for both "LRU list" update and to minimize the memory overhead in the hash
table item to "LRU list" item connection:
- Make the timestamped_val be both a bi::list and a bi::unordered_set
item.
- Make a bi::unordered_set be a cache backend instead of the
std::unordered_map.
As a result dropping k LRU items becomes an O(k) operation instead of
O(N log N), where N is a total number of all cached items:
- Every time a value is read - move it to the front of the "LRU list"
(O(1)).
- When we need to remove k LRU items:
- Repeat k times:
- Take an element from the back of the "LRU list". (O(1)).
- Remove it from the bi::unordered_set and dispose. (O(1)).
We use an auto-unlink configuration for bi::list, therefore
disposing an item is going to auto unlink it from the list.
* 'permissions_cache_move_to_intrusive-v1' of github.com:scylladb/seastar-dev:
utils::loading_cache: cleanup
utils/loading_cache.hh: use intrusive list to store the lru entry
utils::loading_cache: implement automatic rehashing
utils::loading_cache: make the underlying map to be an intrusive unordered_set
"Enforces commutativity of addition:
m1 + m2 == m2 + m1
and consistency of difference and addition with equality:
m1 + (m2 - m1) == m1 + m2"
* tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev:
mutation: Make compare_*_for_merge() consistent with equals()
tests: mutation: Improve assertion failure message
tests: Use default equality in test_mutation_diff_with_random_generator
mutation: Make counter cell difference consistent with apply
tests: range_tombstone_list_test: Improve error message
tests: range_tombstone_list: Check adjacent range merging
range_tombstone_list: Merge adjacent range tombstones in apply()
tests: mutation: Check commutativity of mutation addition
range_tombstone_list: Avoid violating set invariant
range_tombstone_list: Make tombstone merging commutative
range_tombstone_list: Add erase() operation to the reverter
range_tombstone_list: Make all undo operations ordered relative to each other
utils: Extract to_boost_visitor() to a separate header
allocating_strategy: Introduce alloc_strategy_unique_ptr<>
equals() considers expiring cells to be different form non-expiring cells,
but compare_row_marker_for_merge() considers them equal. Fix the latter to
pick expiring cells. The choice was arbitrary.
The case when both cells are dead was not handled properly, the diff
was always empty, whereas the cell with higher timestamp should win.
Caused test_mutation_diff_with_random_generator to fail.
The code was inserting an entry with the same key as its successor,
and only later adjusting the key of the old entry. This is violating
set's invariant of unique keys, and insertion may cause rebalancing. I
don't know if this violation actually causes problems currently, but
it's safer not to.
Fix by first updating the existing entry and then inserting the new
one.
Example of non-commutative case:
a = [1, 5]@t2
b = {[2, 3]@t1, [4, 5]@t1}
a + b = [1, 5]@t2
b + a = [1, 4)@t2, [4, 5]@t2
After this patch, both will yield [1, 5]@t2.
The patch also changes the logic to handle overlaps of tombstones with
equal timestamps to be handled symmetrically. They are now merged
instead of split on either of the boundary.
Refs #2158.
Later operation may depend on the result of previous operation. Same dependency
is present when reverting the operations.
Fixes assertion failure in update reverter.
Fix the shrink() O(n log n) complexity issue by constantly pushing the corresponding intrusive
list entry to the head of the list every time the values are read.
This will keep the list ordered by the last read time from the most recently read
to the least recently read entry.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
- Start the cache with 256 buckets - the minimum number of buckets.
- Limit the maximal number of buckets by 1M buckets.
- Keep the load factor between 0.25 and 1.0 as long as the number of buckets is
between the minimum and the maximum values mentioned above.
- Grow and shrink the hash every "refresh" period if needed.
- Enable bi::power_2_buckets and bi::compare_hash bi::unordered_set options.
- Enable bi::unordered_set_base_hook's bi::store_hash option.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Make the underlying map to be a boost::intrusive::unordered_set<timestamped_val>
instead of std::unordered_set<Key, timestamped_val>.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Metadata is read using default priority class, which can significantly
slow down the process under high load. Compaction class can be used,
and if it turns out to be a problem, we can switch to a special class
for it.
Fixes#1859.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170517184546.17497-1-raphaelsc@scylladb.com>
"fixes a problem in which memory requirement for loading in-memory
components of sstables is very high due to unlimited parallelism."
* 'mem_requirement_sstable_load_v2_2' of github.com:raphaelsc/scylla:
database: fix indentation of distributed_loader::open_sstable
database: reduce memory requirement to load sstables
sstables: loads components for a sstable in parallel
sstables: enable read ahead for read of in-memory components
sstables: make random_access_reader work with read ahead
SSTable load temporarily uses more space than needed to store metadata,
due to:
1) All components are read using read_simple() which uses 128k buffer.
file::dma_read_bulk() will allocate 128k, and may potentially allocate
another big buffer (128k - read) for file::read_maybe_eof().
2) read_filter() may use double the space it needs to.
Due to the fact that sstable loading parallelism is unlimited, Scylla
may require much more memory to load all sstables, and that may lead to
OOM. Higher the number of sstables higher the memory overhead.
To confirm this problem, I wrote a test[1] which loads 30k sstables in
parallel and reports the memory usage peak in the end.
When loading 30k sstables, each of which metadata is ~300kb, memory
usage peak was ~18G. When loading completed, only ~9GB were needed to
store all the metadata.
[1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d
To fix this problem, we need to set a limit on load parallelism (let's
start with a small number like 3 and adjust later if needed) and rely
on readahead so that the requirement drops considerably without
increasing boot time. Actually, boot time is improved by it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Read ahead 4 is used. Let's adjust it later if needed. File size is
used to prevent file_input_stream from issuing useless reads beyond
file size with read ahead enabled. We can switch to variant without
length once file_input_stream handles it properly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Scylla crashes if read ahead is enabled by file_random_access_reader
because a call to seek() destroys the existing input stream without
closing it first.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The bootstrapping node will be a gossip only member, until the streaming
finishes and the node becomes NORMAL state. If during this time, the
bootstrapping node is overwhelmed with streaming, it is possible the node will
delay the update the gossip heartbeat. Be forgiving for the bootstrapping node
and do not remove it from gossip too fast. Otherwise, streaming rpc verbs will
not be resent becasue the node is not in gossip membership anymore.
Fixes#2150
Message-Id: <286d7035d854f2a48abf4e1e2e3bfcb8b22b9ca2.1494553580.git.asias@scylladb.com>
We are currently suspecting that the bloom filter false positive ratio
is not being respected. While trying to debug that, I found out that we
have a more basic problem:
The numbers are all meaningless, because the stats are wrong. We are
accumulating by summing the ratios together. It's easy to see how this
doesn't work, if we look at an example where the ratio for some CFs is
zero:
SST1: false = 1, total = 2. ratio = 0.5
SST2: false = 0, total = 98 . ratio = 0.
The real ratio in this example is 1 / (98 + 2) = 1 %, but the displayed
ratio will be 0.5 + 0 = 0.5.
This patch will map reduce all the sstables together keeping both
numerator and denominator, yielding the right value at the end. To do
that, we'll reuse the existing ratio_holder class, which already does
exactly what we want.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170518222333.16307-1-glauber@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
* seastar 45b718b...f726938 (2):
> memory: add --mbind option to supress warning message when running Seastar apps on container
> Add support for Gentoo Linux irqbalance configuration detection.
"This series fixes bugs related to materialized views, most pertaining
to column filtering in the where clause."
* 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla:
tests/view_schema_test: Add more test cases
tests/cql_assertions: Add assertion for row set equality
single_column_relation: Correctly print IN relation
statement_restrictions: Allow filtering regular columns for views
statement_restrictions: Relax clustering restrictions for views
statement_restrictions: Relax partition restrictions for views
cql3/statements: Prevent setting default ttl on view
cql3/restrictions: Complete implementation of is_satisfied_by()
db/view: Re-implement clustering_prefix_matches()
db/view: Re-implement partition_key_matches()
db/view: Generate regular tombstone for base deletions
db/view: Consider cell liveness when generating updates
db/view: Don't generate view updates for static rows
"There are numerous issues in the current implementation of permissions
cache starting from the logical errors and bugs and ending with the
suboptimal implementation described in the issue #2262."
* 'permissions_cache_fixes-v4' of github.com:scylladb/seastar-dev:
utils::loading_cache: avoid the reads storm when the key is not in the cache
utils::loading_cache: cleanup
utils::loading_cache: align the constrains in the constructor with the parameters description
utils::loading_cache: refresh in the background
auth::auth: add operator<<() for a permission_cache key
auth::auth::permissions_cache: use the values from the configuration - don't try to be smart
db::config: define a saner default value for permissions_validity_in_ms
mutation_partition has a slicing constructor which is supposed to copy
only the rows from the query range. The rows are located using
nonwrapping_range::lower_bound() and
nonwrapping_range::lower_bound(). Those two have two different
implementations chosen with SFINAE. One is using std::lower_bound(),
and one is using container's built in lower_bound() should it
exist. We're using intrusive tree in mutation_partition, so
container's lower_bound() is preferred. It's O(log N) whereas
std::lower_bound() is O(N), because tree's iterator is not random
access.
However, the current rule for picking container's lower_bound() never
triggers, because lower_bound() has two overloads in the container:
./range.hh:618:14: error: decltype cannot resolve address of overloaded function
typename = decltype(&std::remove_reference<Range>::type::upper_bound)>
^~~~~~~~
As a result, the overload which uses std::lower_bound() is used.
Spotted when running perf_fast_forward with wide partition limit in
cache lifted off. It's so slow that I timeouted waiting for the result
(> 16 min).
Fixes#2395.
Message-Id: <1495048614-9913-1-git-send-email-tgrabiec@scylladb.com>
"These patches add support to setup and operate ScyllaDB on Gentoo Linux.
* scylla_setup and related scripts
* node_health_check
I have kept them as simple as possible and tested them to setup and operate
succesfully a three nodes cluster running on Gentoo Linux."
* 'gentoo_linux_support' of github.com:ultrabug/scylla:
scylla_setup: add gentoo linux installation detection
prometheus node_exporter install: add support for gentoo linux
raid setup: add support for gentoo linux
ntp setup: add support for gentoo linux
kernel check: add support for gentoo linux
cpuscaling setup: add support for gentoo linux
coredump setup: add support for gentoo linux
detect gentoo linux on selinux setup
add gentoo_variant detection and SYSCONFIG setup
According to description of permissions_validity_in_ms the permissions_cache is enabled if this
value is set to a non-zero value. Otherwise the permissions_cache is disabled.
According to the permissions_update_interval_in_ms description it must have a non-zero value if permissions_cache
is enabled.
permissions_cache_max_entries description doesn't explicitly state it but it makes no sense to allow it to be zero
if permissions_cache is enabled.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patch changes the way a loading_cache works.
Before this patch:
1) If a permissions key is not in the cache it's loaded in the foreground and the original
query is blocked till the permissions are loaded.
2) Every _period the timer does the following:
1) If a value was loaded more than _expiry time ago it is removed from the cache.
2) If the cache is too big - the less recently loaded values are removed till the cache
fits the requested size.
After this patch:
1) If a permissions key is not in the cache it's loaded in the foreground and the original
query is blocked till the permissions are loaded.
2) Every _period the timer does the following:
1) If a value in the cache was loaded or read for the last time more than _expiry time ago - it's removed from the cache.
2) If the cache is too big - the less recently read values are removed till the cache fits the requested size.
3) The values that were loaded more than _refresh time ago are re-read in the background.
The new implementation allows to minimize the amount of the foreground reads for a frequently used value to a single
event (when the value is loaded for the first time).
It also ensures we do not reload values we no longer need.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Our configuration already has the default values for for permission cache parameters.
Therefore if user decides to give some bad parameters we'd rather fail the load and inform him/her
about the bad parameters instead of trying to silently "fix" them.
In addition the original code wasn't passing the parameters correctly: it switched the "expiry" and "refresh" parameters in
the utils::loaded_cache constructor.
Add to this that the original code was doing really strange things in the permission_cache::expiry(cfg) method.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
It makes little sense to have the same value for permissions_update_interval_in_ms and permissions_validity_in_ms.
This may cause the values to be invalidated only because some minor delays in the timer scheduling.
It makes a lot more sense to make the permissions_update_interval_in_ms value smaller than permissions_validity_in_ms.
This way we would minimize the chances of "false invalidation" due to some small delays in the timer scheduling.
In addition, 2s seems to be a too small value for permissions_validity_in_ms since our default read_request_timeout_in_ms is 5s.
This means that a single system_auth read failure would guarantee that the following queries are going to read system_auth data
in the foreground.
Setting it to 10s would allow a second read attempt before we enforce the foreground read.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"Notably:
- add validation of the results (e.g. fragment count, expectations about disk activity)
- add cache-specific tests"
* 'tgrabiec/add-cache-tests-to-perf-fast-forward' of github.com:cloudius-systems/seastar-dev:
tests: perf_fast_forward: Report cache stats
row_cache: Keep counters in a struct
tests: perf_fast_forward: Add cache-specific tests
tests: perf_fast_forward: Extract test_reading_all()
tests: perf_fast_forward: Add validation of the results
tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments
tests: perf_fast_forward: Allow the test to be interrupted
tests: perf_fast_forward: Allow testing with cache enabled
row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
Not 100% proper, but in line with how we still store the info.
Ensures (helps at least) to keep schema loaded from tables
and schema from builder comparable.
Fixes schema_changes_test error.
Message-Id: <1495030581-2138-2-git-send-email-calle@scylladb.com>
"When mutation reader enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request."
* 'tgrabiec/fix-initial-fast-forward-to-for-single-key-sstable-readers' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test forwarding in single-key readers
sstables: Remove unused code
sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
sstables: Fix verify_end_state() to tolerate ATOM_START_2 state
"This series adds private repository support to scylla-housekeeping"
* 'amnon/housekeeping_private_repo_v3' of github.com:cloudius-systems/seastar-dev:
scylla-housekeeping service: Support private repositories
scylla-housekeeping-upstart: Use repository id, when checking for version
scylla-housekeeping: support private repositories
make_pkeys() needs to be invoked with n equal to the number of keys
which the table was populated with. Otherwise the extra keys, which
are missing in the table, may be placed anywhere in the vector due to
ring order sorting, and break the assumption that the table contains
all keys from the array up to index n. This resulted in the test
reading slighlty less fragments than it would follow from the desired
count.
Another problem is that we should not skip the fast_forward_to() call
for the inital range (workaround for a bug in sstable mutation
reader), otherwise we will read slightly less than expected as well.
From http://github.com/avikivity/scylla exponential-sharder/v3.
The sharder, which takes a range of tokens and splits it among shards, is
slow with large shard count and the default
murmur3_partitioner_ignore_msb_bits.
This patchset fixes excessive iteration in sstable sharding metadata writer and
nonsignular range scans.
Without this patchset, sealing a memtable takes > 60 ms on a 48-shard
system. With the patchset, it drops below the latency tracker threshold I
used (5 ms).
Nonsingular queries used exponential expansion of the token space to
avoid spending too much cpu time on near-empty tables, but the generation
of the search space was itself exponential. Switch to the exponential sharder
which has linear cost.
The "exponentiality" is not carried over from one range to another, because
we expect one or two ranges (two ranges result from a wrapped around thrift
token range).
Like the regular sharder, the exponential sharder divides a range into
subranges owned by individual ranges. Unlike the regular sharder, it
generates ever-increasing subranges, spanning more and more shards, and
eventually returns several subranges per shard. To avoid using
exponential cpu and memory, subranges belonging to a single shard are merged,
and a flag is set to indicate the subranges are not ordered wrt. each other.
Right now, next_token_for_shard() only allows iterating linearly in shard
order. Add the ability to select a specific shard to skip to (in case we're
only interested in a single shard), and to select larger ranges (so that
exponential increases are not implemented by iteration).
The partitioners now depend on smp::count to be initialized correctly,
but smp::count isn't available at static initialization time.
The scylla executable isn't affected because it calls set_global_partitioner()
after smp::count has been initialized.
Fix by deferring initialization to the first global_partitioner() call.
"This patch ensures we read the base table rows that an update
is modifying, in order to correctly calculate the set of
materialized view updates.
The read-before-write is performed on the shard applying the
update and attempts to do a precise read of the rows being modified,
which can be more than one in case of ranged deletions or a
batch update."
* 'materialized-views/read-existing/v2' of https://github.com/duarten/scylla:
database: Read existing base mutations
db/view: Calculate clustering ranges for MV read-before-write query
db/view: Replace entry if cells don't match
view_info: Store base regular col in the view's PK as column_id
compound_view_wrapper: Add tri_compare
bound_view: Build range bound from bound_view
clustering_bounds_comparator: Enable Range concept
range: Add lvalue version of transform()
tests: Add test case for nonwrapping_range::intersection()
nonwrapping_range: Add intersection() function
When generating updates for a materialized view we need to read the
existing base row, to be able to determine the primary key of the view
row the new base update will supplant, in case the view includes a
base non-primary key column in its own primary key. That old view row
will be tombstoned or updated, if it exists, depending on the difference
between the new base row and the existing one, if any.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
For row set equality, the order of the actual rows doesn't need to
match the order of the expected rows.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Introduce the calculate_affected_clustering_ranges() function to
calculate the smallest subject of affected clustering ranges that we
need to query for.
The update_requires_read_before_write() function checks whether
a view is potentially affected by the base update.
The patch also cleans up the may_be_affected_by() function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
So that the output of a set of relations can be fed back into the CQL
parser; useful for materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If a base table regular columns is part of the view's pk, and if that
column changes, we should replace the entry, by deleting the row(s)
with the old value and inserting a new one.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In process_clustering_columns_restrictions(), don't require all
clustering columns to be restricted if we're dealing with a
materialized view's where clause.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In process_partition_key_restrictions(), don't require all partition
key columns to be restricted if we're dealing with a materialized
view's where clause.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the is_satisfied_by() function for the remaining
types of restrictions, lifting the function declaration to
abstract_restrictions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements clustering_prefix_matches() in terms of
abstract_restriction::is_satisfied_by() instead of ranges, which
supports filtering just a subset of the clustering columns.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements partition_key_matches() in terms of
abstract_restriction::is_satisfied_by() instead of ranges, which
supports filtering just a component of a compound partition key.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures we take into account the liveness of the base's
regular column in the view's pk when generating view updates.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch stores the base_non_pk_column_in_view column as column_id,
which is more convenient, and it also stores a two-level optional to
encode both lazy initialization and the absence of such a column.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We introduce the bound_view::to_range_bound() function, which builds a
wrapping_range or nonwrapping_range bound from a bound_view.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
intersection() returns an optional range with the intersection of the
this range and the other, specified range.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Defines origin v3-format for system/schema tables, and use them for
schema storage/retrival.
Includes a legacy_schema_migrator implementation/port from origin. Note
that since we don't support features like triggers, functions and
aggregates, it will bail if encountering such a feature used.
Note also that this patch set does not convert the "hints" and
"backlog" tables, even though these have changed in v3 as well.
That will be a separate patch set.
Tested against dtests. Note that patches for dtest + ccm
will follow."
* 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits)
legacy_schema_migrator: Actually truncate legacy schema tables on finish
database: Extract "remove" from "drop_columnfamily"
v3 schema test fixes
thrift: Update CQL mapping of static CFs
schema_tables: Use v3 schema tables and formats
type_parser: Origin expects empty string -> bytes_type
cf_prop_defs: Add crc_check_chance as recognized (even if we don't use)
types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s
cql3_type: v3 to_string
cql_types: Introduce cql3_type::empty and associate with empty data_type
schema: rename column accessors to be in line with origin
schema: Add "is_static_compact_table"
schema_builder: Add helper to generate unique column names akin origin
schema: Add utility functions for static columns
schema: Use heterogeneous comparator for columns bounds
cql3_type_parser: Resolve from cql3 names/expressions
cql3_type: Add "prepare_interal" and "references_user_type"
cql3::cql3_type: Add prepare_internal path using only "local" holders
cql3_type: Add virtual destructor.
database/main: encapsulate system CF dir touching
...
* seastar 4a3118c...45b718b (7):
> tests: make connect_test use a random port
> log: Introduce log.info0
> configure.py: link to DPDK PMD drivers which are already built on build/dpdk and enabled by default on DPDK config
> Update fmt submodule
> perftune: fix perftune.py IndexError when NIC uses less IRQs than requested.
> build: Add more required build dependencies to the Dockerfile
> Prometheus: Reserve in protobuf object before iterating
"Tested with:
- test.py --mode relase
- debug/test-serialization
- c-s with both debug and relase compiled scylla with authentication enabled:
cassandra-stress write n=10000 no-warmup -rate threads=10 -mode native unprepared cql3 user='cassandra' password='cassandra'
Tested with:
- test.py --mode relase
- debug/test-serialization
- c-s with both debug and relase compiled scylla with authentication enabled:
cassandra-stress write n=10000 no-warmup -rate threads=10 -mode native unprepared cql3 user='cassandra' password='cassandra'"
* 'compress_tracing_session_id-v6' of github.com:cloudius-systems/seastar-dev:
cql_server::response: rework the tracing session ID insertion
utils::UUID: align the UUID serialization API with the similar API of other classes in the project
utils: serialization: unify the variety of serialize_XXX(...)
cql_server::response: rework the compress(...) method
cql_server::response: store the frame flags inside the class
"This series fixes a set of regressions introduced by
f7bc88734a, resulting in two failed
tests:
testDenseNonCompositeTable(org.apache.cassandra.cql3.validation.operations.CreateTest)
and
testStaticColumnsWithDistinct(org.apache.cassandra.cql3.validation.entities.StaticColumnsTest)"
* 'cql-fixes/v1' of github.com:duarten/scylla:
update_statement: Reject empty values for dense clustering key
modification_statement: Fix detection of clustering keys
cql3/restrictions/statement_restrictions: Consider statement type
cql3/statements/modification_statement: Extract statement_type
Insert the tracing session ID into the response body in the cql_server::response constructor.
Fixes#2356
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The standard serialization API (e.g. in data_value) includes the following methods:
size_t serialized_size() const;
void serialize(bytes::iterator& it) const;
bytes serialize() const;
Align the utils::UUID API with the pattern above.
The only addition is that we are going to make an output iterator parameter of a second method above
a template so that we may serialize into different output sources.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Use the same templated implementation for all different serialize_XXX(...).
The chosen implementation is based on the std::copy_n(char*, size, OutputIterator),
which is heavily optimized and will be using memcpy/memmove where possible.
This patch also removes the not needed specializations that accept signed integer
values since we were casting them to unsigned value anyway.
The std::ostream based specifications are also removed since they are not used
anywhere except for a test-serialization.cc and adjusting the ostream to the iterator
is a single-liner.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Cleanup the compress(...) method interface:
- Encapsulate the technical details inside the method:
- Re-write the _body inside the method instead of returning it.
- Set the response::_flags inside the method.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
It makes a lot more sense to keep the flags mask inside the response and update it each time
the corresponding feature is set instead of holding the separate components like tracing state
pointer.
This patch adds this ability to set the flags.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.
Fixes#2384
Message-Id: <20170516103018.GQ3874@scylladb.com>
When mutation reads enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request.
We would be in that state if consume_row_start() returns porceed::yes
and the stream ends after that. This can happen if slicing using
promoted index determined that there are no cells in the partition in
the range.
Newer setuptools parse_version() don't like dashed version strings,
so we should trim it to avoid false negative version_compare() checks.
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170511162646.22129-1-ultrabug@gentoo.org>
Prevent the accidental dropping of system_auth and system_traces objects (keyspaces and tables)
but allow their modification (including tables).
We need to be able to modify keyspases in order to set/modify the replication strategy and its parameters.
We need to be able to ALTER the tables in order to allow rolling upgrades when some of the tables has changed.
Fixes#2346Fixes#2338
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1494363335-20424-1-git-send-email-vladz@scylladb.com>
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.
Fixes#2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>
Now that update_statement uses statement_restrictions, we need our
validation logic to take the statement type into account, in
particular to deal with insertion statements which only set static
columns but specify clustering values.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the statement_type into its own file. The type
will be later passed to statement_restrictions for validation
purposes.
Further along, we could add methods to it that currently live in other
statements so we can move more validation into statement_restrictions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch updates the mapping of static CFs so that their CQL
representation is a non-compound, non-dense schema with static
columns, instead of regular ones. This matches the representation os
static CFs in Cassandra 3.x.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
More pointedly: Expose columns as is (currently
all_columns_in_select_order), expose name->column mapping more
appropriately named.
Renaming like this is not strictly neccesary, but there is a point to
trying to keep nomenclature similar-ish with origin, esp. when select
order column need to become filtered (spoiler alert).
Cassandra 3 uses cql names for column/field types, thus
we need to parse these out-of-line, and resolve more akin
to the cql parser.
Also wrap building user types similarly to origin, using
a "builder" wrapper, and usage graph resolving.
Regular single key query will never reconcile with CL=ONE since there
will be no digest mismatch, but range queries do not have digest stage,
so always goes through reconcile code. For CL=ONE there will be only one
result though, so no need to run complicated reconciliation logic and the
only result can be returned directly.
Message-Id: <20170509100334.GQ28272@scylladb.com>
The "_specs" array contains column specifications that have the bind
marker name if there is one. That results in
get_partition_key_bind_indices() not being able to look up a column
definition for such columns. Fix the issue by keeping track of the
actual column specifications passed to add() like Cassandra does.
Fixes#2369
Message-Id: <1494397358-24795-1-git-send-email-penberg@scylladb.com>
Having a varadic parameter being used in implicit sprint is not
very readable + makes it less intuitive when suddenly system keyspace
becomes more than one -> multiple sprints in the chain -> more confusion
or more execution paths.
Its not that horrible with some spread out sprint:s
Reduces processing overhead and network traffic.
We can't use the NO_METADATA flag in the metadata object, because this
is a request attribute; different executions of the same prepared statement
can have different settings for skip_metadata.
Message-Id: <20170419175145.19766-1-avi@scylladb.com>
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."
* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
schema: Kill index_type enum
schema: Kill index_info class
cql3/statements/create_index_statement: Use database::existing_index_names() in validation
cql3/statements: Use secondary index manager in alter_table_statement class
index: Add secondary_index_manager
thrift/handler: Use index_metadata
db/schema_tables: Index persistence
schema: Add all_indices() to schema class
schema: Remove add_default_index_names() from schema_builder class
db/schema_tables: Add system table for indices
cql3/Cgl.g: DROP INDEX
cql3/statements: Add drop_index_statement class
database: Add find_indexed_table() to database class
cql3: Return change event from announce_migration()
cql3/statements: Multiple index targets for CREATE INDEX
cql3/statements: Use index_metadata in create_index_statement class
cql3/statements: Use feature flag in create_index_statement class
service/storage_service: Add feature flag for secondary indices
database: Add get_available_index_name() to database class
schema: Add get_default_index_name() to index_metadata class
...
Fix the CQL front-end to populate the partition key bind index array in
result message prepared metadata, which is needed for CQL binary
protocol v4 to function correctly.
Fixes#2355.
Message-Id: <1494247871-3148-1-git-send-email-penberg@scylladb.com>
"This series introduces partial support for range deletions. This allows
deletion operations such as
delete from cf where p=1 and c > 0 and c <= 3.
This series only adds support for single-column range restrictions.
We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.
We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.
This limitation should stay in place until we can properly represent range
tombstones in the storage format."
* 'range-deletions/v2' of https://github.com/duarten/scylla:
mutation: Set cell using clustering_key_prefix
mutation_partition: Harmonize apply_delete overloads
prefix_compound_view_wrapper: Add is_full and is_empty functions
tests/cql_query_test: Add range deletion tests
cql3: Partially support ranged deletions
single_column_primary_key_restrictions: Implement has_bound()
modification_statement: Use statement_restrictions for where clause
statement_restrictions: Expose primary key restrictions
to_string: Add missing include
Current script has a bug that overwrites symlink on
/etc/sysconfig/selinux by real file, the script not able to disable
SELinux because of it.
So keep symlink after modifying the file.
Fixes#2279
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493663263-10573-1-git-send-email-syuu@scylladb.com>
This is a folded version of the following rework on the node health
check script:
- Added support for non-default cql + nodetool ports
- Script will not exit if either Scylla-server / Scylla-jmx / Both
services are not up and running. It will alert the user about it and
which output cannot be collected, but continue collecting everything
else.
- Removed lshw installation and non-needed use in commands
- Script supports RHEL/CentOS/Ubuntu14/Ubuntu16/Debian (tested on all
beside Debian, should behave the same as Ubuntu14/16)
- All Indentation issues fixed -> using only tab (no spaces)
consistently.
- >> vs. > was fixed as well in the needed places.
- Changes the ${VAR_NAME} instances to $VAR_NAME, and kept the {} only
where needed.
- Check Scylla service as Vlad recommended using 'ps -C'
- Fixed the CQL not listening error message.
- Added Sanity check if script is attempted to run on non-Fedora and
non-Debian OS -> alert the user and exit.
- Removed the MANUAL CHECK LIST section (moved to Google Forms)
- Added date in head of the report.
- Removed text from Report's "PURPOSE" section, which was referring to
the "MANUAL CHECK LIST" (not needed anymore).
[ penberg: Fold into a single commit and add proper license. ]
Acked-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1493900076-29170-1-git-send-email-penberg@scylladb.com>
"gcc 7 doesn't like some of our code, so adjust to make it happy."
* 'gcc7' of http://github.com/avikivity/scylla:
Remove exception specifications
commitlog: handle noexcept conflict between unlink and function object
thrift: change generated code namespace
org::apache::cassandra (the generated namespace name) gets confused with
apache::cassandra (the thrift runtime library namespace), either due to
changes in gcc 7 or in thrift 0.10. Either way, the problem is fixed
by changing the generated namespace to plain cassandra.
"Fixes abort which happens when making a single-key query for a key which
is after all keys present in the sstable."
* 'tgrabiec/fix-abort-in-index-reader' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Add test cases for single-key out of range reads
sstables: index_reader: Remove redundant function
sstables: index_reader: Fix abort in advance_and_check_if_present()
Change the clustering key argument in mutation::set_cell from
exploded_clustering_prefix to clustering_key_prefix, which allows for
some overall code simplification and fewer copies. This mostly affects
the cql3 layer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures the different mutation_partition::apply_delete()
overloads behave similarly, so that, for example, an empty clustering
key is treated the same way as an empty
exploded_clustering_key_prefix.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces partial support for range deletions. This allows
deletion operations such as
delete from cf where p=1 and c > 0 and c <= 3.
We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.
We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.
This limitation should stay in place until we can properly represent range
tombstones in the storage format.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch replaces the custom where clause processing by adding and
using a statement_restrictions field to modification_statement.
This improves code reuse and also moves some checks to prepare-time.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This changes announce_migration() to return a change event directory in
schema_altering_statement base class. It's needed for drop index
statement, which does not know the keyspace or column family until it
looks up them based on the index. Two stage approach of announcing a
migration and then creating the change event won't work because in the
latter stage, the lookup will fail. The same change in
announce_migration() has been applied to Apache Cassandra.
This adds a find_index_nomame() helper to the schema class, which
searches for index that is otherwise equal but ignores the name of the
index in comparison. This is needed to for CREATE INDEX to reject
duplicate index creation.
After compaction revamp, compaction type set by cleanup at its ctor
is being overwritten at compaction::setup(). Consequently, cleanup
would not be stopped by 'nodetool stop cleanup' and cleanup would
be listed as regular compaction in 'nodetool compactionstats'.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170504013046.23522-1-raphaelsc@scylladb.com>
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.
The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.
Fixes#2299
Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
"This series fixes the user after free issue in gossip and elimates the
duplicated / unnecessary mark alive operations.
Fixes#2341"
* tag 'asias/gossip_fix_mark_alive/v1' of github.com:cloudius-systems/seastar-dev:
gossip: Ignore callbacks and mark alive operation in shadow round
gossip: Ingore the duplicated mark alive operation
gossip: Fix user after free in mark_alive
In shadow round, we only interested in the peer's endpoint_state, e.g., gossip
features, host_id, tokens. No need to call the on_restart or on_join callbacks
or to go through the mark alive procedure with EchoMessage gossip message. We
will do them during normal gossip runs anyway.
After sending echo message, the Node might not be in the
endpoint_state_map anymore, use the reference of local_state
might cause user after free.
Fixes#2341
These will as of-late always be ready. Removing the future usage
in preparation for changing api signature to void(*)() (i.e. prevent
breakage on seastar update)
While blob_storage is marked as an unaligned type, the back references also
point to an unaligned type (a pointer to blob_storage), since a back
reference can live in a blob_storage. This triggers errors from zapcc/clang 4.
Fix by creating a type for the reference, which is marked as unaligned.
Message-Id: <20170502071404.507-1-avi@scylladb.com>
after 'compaction: make major compaction go through compaction manager',
the test fails because task is preempted in debug mode before it reaches
intruction to increase stat.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170501183255.6191-1-raphaelsc@scylladb.com>
During the changes in the way the housekeeping check for newer version
and warn about it in the installation the UUID part was removed but kept
in the sarounding if.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170426075724.7132-1-amnon@scylladb.com>
Some fields that belong to regular and cleanup aren't needed for
resharding_compaction, such as incremental selector (which is used
for determining max purgeable timestamp for a given decorated key)
Better move those fields to regular and make cleanup inherit from
regular compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428195611.9196-1-raphaelsc@scylladb.com>
Currently, fully expired sstable[1] is unconditionally chosen for compaction
by DTCS, but that may lead to a compaction loop under certain conditions.
Let's consider that an almost expired sstable is compacted, and it's not
deleted yet, and that the new sstable becomes expired before its ancestor is
deleted.
Because this new sstable is expired, it will be chosen by DTCS, but it will
not be purged because 'compacted undeleted' sstables are taken into account
by calculation of max purgeable timestamp and prevents expired data from
being purged. The problem is that this sequence of events can keep happening
forever as reported by issue #2260.
NOTE: This problem was easier to reproduce before improvement on compaction
of expired cells, because fully expired sstable was being converted into a
sstable full of tombstones, which is also considered fully expired.
Fixes#2260.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>
"The logic responsible for converting counter updates to counter shards was
not covered by unit tests and didn't transform counter cells inside static
rows.
This series fixes the problem and makes sure that the tests cover both
static rows and transformation logic."
* tag 'pdziepak/static-counter-updates/v1' of github.com:cloudius-systems/seastar-dev:
tests/counter: test transform_counter_updates_to_shards
tests/counter: test static columns
counters: transform static rows from updates to shards
"Fixes #2326."
* 'tgrabiec/fix-range-tombstones-missing-when-slicing' of github.com:cloudius-systems/seastar-dev:
tests: mutation_source_test: Cover single-ranged queries in test_streamed_mutation_slicing_returns_only_relevant_tombstones()
tests: mutation_source_test: Add test for slicing of clustered rows
tests: mutation_reader_assertions: Log expectations
tests: mutation_reader_assertions: Add produces_eos_or_empty_mutation()
tests: sstables: Use read_row() for single-key reads
tests: sstables: Test more configutaions of sstable writer in test_sstable_conforms_to_mutation_source()
sstables: Improve logging
sstables: index_reader: Fix advance_to() to include relevant range tombstones
Attempting to create huge zones may introduce significant latency. This
patch introduces the maximum allowed zone size so that the time spent
trying to allocate and initialising zone is bounded.
Fixes#2335.
Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>
So that as_mutation_reader() will create the same kind of reader which
database::make_sstable_reader() does.
Before this change, all readers were range readers.
Test different versions of the format, and different promoted index
block sizes. The size of 1 is especially important, it will put each
fragment in a separate block, exposing various issues with promoted
index handling.
We set the scheduler wakeup granularity to 500usec, because that is the
difference in runtime we want to see from a waking task before it
preempts the running task (which will usually be Scylla). Scheduling
other processes less often is usually good for Scylla, but in this case,
one of the "other processes" is also a Scylla thread, the one we have
been using for marking ticks after we have abandoned signals.
However, there is an artifact from the Linux scheduler that causes those
preemption to be missed if the wakeup granularity is exactly twice as
small as the sched_latency. Our sched_latency is set to 1ms, which
represents the maximum time period in which we will run all runnable
tasks.
We want to keep the sched_latency at 1ms, so we will reduce the wakeup
granularity so to something slightly lower than 500usec, to make sure
that such artifact won't affect the scheduler calculations. 499.99usec
will do - according to my tests, but we will reduce it to a round
number.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170427135039.8350-1-glauber@scylladb.com>
Fix issues found by PVS-Studio as reported by Phillip Khandeliants.
Merge branch 'pvs_analyzer_errors-v1' of github.com:cloudius-systems/seastar-dev
* 'pvs_analyzer_errors-v1' of github.com:cloudius-systems/seastar-dev:
type_parser: catch exceptions by reference and not by value
token_metadata::get_host_id(ep): add a missing 'throw'
Found by PVS-Studio static analyzer:
Type slicing. An exception should be caught by reference rather than by value.
Fixes#2288
Reported-by: Phillip Khandeliants
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Caught by PVS-Studio static analyzer:
The object was created but it is not being used. The 'throw' keyword could be missing: throw runtime_error(FOO);
Reported-by: Phillip Khandeliants
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
A column family which was truncated will remove itself from compaction
manager. Any task running a compaction should be interrupted and a
task waiting to run should bail out when it wakes up.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170425224350.15965-3-raphaelsc@scylladb.com>
Introduce two time series helper tables that will simplify the querying
of traces.
One for querying regular traces:
CREATE TABLE system_traces.sessions_time_idx (
minute timestamp,
started_at timestamp,
session_id uuid,
PRIMARY KEY (minute, started_at, session_id))
and one for querying slow query records:
CREATE TABLE system_traces.node_slow_log_time_idx (
minute timestamp,
started_at timestamp,
session_id uuid,
start_time timeuuid,
node_ip inet,
shard int,
PRIMARY KEY (minute, started_at, session_id))
With these tables one may get the relevant traces like in an example below:
SELECT * from system_traces.sessions_time_idx where minutes in ('2016-09-07 16:56:00-0700') and started_at > '2016-09-07 16:56:30-0700'
Fixes#2243
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patch makes the tracing framework follow the general idea of Google's
Dapper paper: traces generated in a context of the same query are forming
a single-rooted acyclic tree where in a ScyllaDB case vertexes are spans running
on each involved replica Node and edges are RPCs sent from one Node to another.
- Each vertex in the tree above has an ID - "span ID".
- In order to be able to build the tree from the sessions traces we need
to know the parent "span ID" - the ID of a span that sent an RPC that created
the current span.
- Each span of a tracing session is given a 64-bit random span ID.
- The root span has a span_id::illegal_id value.
This patch adds:
- The described above parent span ID and a span ID to the one_session_records
object.
- The current span ID is passed in the trace_info struct to the remote replica.
- Add parent_id and span_id columns to system_traces.events table for the parent
ID and span ID.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
After 4742008b70, _read_partial_row is
never set, and we will fail here in case the consumer will exhoust the
range. That would be the case if the end bound of the slice aligns
with the end of the index page.
Fix by assuming that if we're out of range in the middle of partition,
we sliced.
Message-Id: <1493121249-18847-1-git-send-email-tgrabiec@scylladb.com>
"This series introduces the row_tombstone class, which represents a
tombstone applied to a clustering row. It distinguishes itself from a
normal tombstone by the fact that it contains a regular tombstone and
a shadowable one, which can be erased by a row marker.
The intent of the series is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be, leading to incorrect results."
* 'materialized-views/shadowable/v5' of https://github.com/duarten/scylla:
sstables: Read and write shadowable tombstones
mutation_partion: Use row_tombstone
mutation_partion: Introduce row_tombstone
mutation_partition: Introduce shadowable tombstones
idl-compiler: Support optional fields in views
tombstone: Extract out relational operators
row_marker: Mark constructors explicit
This patch replaces the current row tombstone representation by a
row_tombstone.
The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.
We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:
1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3
These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.
This patch only addresses the memory representation of such
row_tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the row_tombstone class, which represents a
tombstone made up of a regular tombstone and a shadowable one.
The rules for row_tombstones are as follows:
- The shadowable tombstone is always >= than the regular one;
- The regular tombstone works as expected;
- The shadowable tombstone doesn't erase or compact away the regular
row tombstone, nor dead cells;
- The shadowable tombstone can erase live cells, but only provided
they can be recovered (e.g., by including all cells in a MV update,
both updated cells and pre-existing ones);
- The shadowable tombstone can be erased or compacted away by a newer
row marker.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A shadowable tombstone is a tombstone that can be replaced by a
smaller one if provided a row_marker with a bigger timestamp than the
shadowable tombstone.
In the context of a row, it is only valid as long as no newer insert
is done (thus setting a live row marker; note that if the row
timestamp set is lower than the tombstone's, then the tombstone
remains in effect as usual). If a row has a shadowable tombstone with
timestamp Ti and that row is updated with a timestamp Tj, such that
Tj > Ti (and that update sets the row marker), then the shadowable
tombstone is shadowed by that update. A concrete consequence is that
if the update has cells with timestamp lower than Ti, then those cells
are preserved (since the deletion is removed), and this is contrary to
a regular, non-shadowable row tombstone where the tombstone is
preserved and such cells are removed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts out the relational operators in struct tombstone
to a class capable of generating them from a tri-compare function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Some gcc versions incorrectly complain:
tests/log_histogram_test.cc:87:22: error: ‘opts1’ is not a valid template argument for type ‘const log_histogram_options&’ because object ‘opts1’ has not external linkage
size_t hist_key<node<opts1>>(const node<opts1>& n) { return n.v; }
Apparently this is a bug in gcc:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52036Fixes#2307.
Message-Id: <1493108791-11247-1-git-send-email-tgrabiec@scylladb.com>
"This series fixes some more errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler. A single issue remains, but it's
probably in std::experimental::optional::swap(); not in our code."
* tag 'clang/2/v1' of https://github.com/avikivity/scylla:
sstable_test: avoid passing negative non-type template arguments to unsigned parameters
UUID: add more comparison operators
sstable_datafile_test: avoid string_view user-defined literal conversion operator
mutation_source_test: avoid template function without template keyword
cql_query_test: define static variable
cql_query_test: add braces for single-item collection initializers
storage_service: don't use typeid(temporary)
logalloc: remove unused max_occupancy_for_compaction
storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
storage_proxy: drop unused member access from return value
storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
read_repair_decision: fix operator<<(std::ostream&, ...)
Fixes the following error in "scylla segment-descs" and a similar one in "scylla lsa-segment":
Traceback (most recent call last):
File "scylla-gdb.py", line 530, in invoke
gdb.write('0x%x: lsa free=%d region=0x%x zone=0x%x\n' % (addr, desc['_free_space'], desc['_region'], desc['_zone']))
TypeError: %x format: an integer is required, not gdb.Value
Message-Id: <1493029465-6482-1-git-send-email-tgrabiec@scylladb.com>
Every lsa-allocated object is prefixed by a header that contains information
needed to free or migrate it. This includes its size (for freeing) and
an 8-byte migrator (for migrating). Together with some flags, the overhead
is 14 bytes (16 bytes if the default alignment is used).
This patch reduces the header size to 1 byte (8 bytes if the default alignment
is used). It uses the following techniques:
- ULEB128-like encoding (actually more like ULEB64) so a live object's header
can typically be stored using 1 byte
- indirection, so that migrators can be encoded in a small index pointing
to a migrator table, rather than using an 8-byte pointer; this exploits
the fact that only a small number of types are stored in LSA
- moving the responsibility for determining an object's size to its
migrator, rather than storing it in the header; this exploits the fact
that the migrator stores type information, and object size is in fact
information about the type
The patch improves the results of memory_footprint_test as following:
Before:
- in cache: 976
- in memtable: 947
After:
mutation footprint:
- in cache: 880
- in memtable: 858
A reduction of about 10%. Further reductions are possible by reducing the
alignment of lsa objects.
logalloc_test was adjusted to free more objects, since with the lower
footprint, rounding errors (to full segments) are different and caused
false errors to be detected.
Missing: adjustments to scylla-gdb.py; will be done after we agree on the
new descriptor's format.
We both move names_ to its destination, and call names_.size() in the same
expression; this has undefined evaluation order, and fails with clang.
With this patch as well as the clang build fixes, Scylla starts and is
able to serve requests (light cassandra-stress load).
Message-Id: <20170423121727.1948-1-avi@scylladb.com>
This patch fixes a failure of virtual_reader_test, where both the test
itself and the cql_test_env initialize the messaging_service to listen
on the same address and port, triggering an assert in
posix_ap_server_socket_impl::accept().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170423104240.21275-1-duarte@scylladb.com>
Clang warns that the expression will be evaluated (doh). While the warning
seems dubious, keep it and change the code to call the function outside
typeid(), in case it does help someone one day.
Clang's std::abs() doesn't support __int128_t, so use __int64_t instead. With
this change, it's possible that a read repair 252,700 years after a write
will be interpreted as a recent write and the read repair will incorrectly
be skipped; hopefully by that time __int128_t will be standardized.
Argument-dependent lookup requires that the operator be declared in the
same namespace as the class; move it there.
While at it, de-static it, it only causes bloat.
"Currently, a shared sstable is rewritten at all shards it belongs to, and only
after that, it's deleted.
This new algorithm adds the ability to reshard a set of sstables together at a
single shard and produce unshared sstable for all shards involved.
That's important for the leveled compaction strategy issue, in which the number
of sstables growing considerably after resharding. What happened is that every
sstable was being split into N ones, so we could end up with tons of small
sstables. Now, we will reshard together a set of adjacent sstables."
* 'sstable_resharding_revamp_v9' of github.com:raphaelsc/scylla:
tests: add test for new sstable resharding
database: kill column_family::start_rewrite
database: wire up new resharding algorithm
database: implement new sstable resharding algorithm
database: introduce function to replace new sstables by their ancestors
prevent regular compaction from choosing shared sstables
compaction_strategy: implement resharding strategy for compaction strategies
sstables: store more info in foreign_sstable_open_info
sstables: make it possible to get open info from loaded sstable
database: export column family dir
database: inform if column family has shared tables
sstables: add method to export ancestors
lcs: implement get_level_count
compaction_manager: introduce method to check if manager stopped
lcs: restore invariant instead of sending overlapping sst to L0
sstables: extend compaction for new resharding
sstables: allow shard A to correctly create sstable for shard B
compaction: rework compacting_sstable_writer to work with multiple writers
compaction: prepare compacting_sstable_writer to work with writers
sstables: rework compaction to make it easy to extend
NOTE: it's not wired yet.
Currently, a shared sstable is rewritten at all shards it belongs
to and only after that, it's deleted. With this new algorithm, a
shared sstable will be read only once and N unshared sstables
will be created, each of them with 1/N of the data. After it's
done, each owner shard will receive its new unshared sstable
replacing its ancestors.
Another benefit is that we'll no longer have resharding resulting
in number of sstables growing considerably after resharding.
A full-sized leveled sstable is usually 160MB, so after resharding,
we could have N files of 160MB/N. Now, leveled strategy will help
resharding. N adjacent sstables of same level will be resharded
together, so we'll end up with N files of N*160MB/N.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When resharding, we're working with sstables from all shards. So let's say
we're done with resharding of sstable A that belongs to shard 0 and 1 and
sstable B that belongs to shard 1 and 2. SStables were generated for
shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables
and remove the ancestors. Shard 1 for example will remove sstables A and
B (ancestors) and add the new one. Then it comes this new function.
We'll forward new sstables to their target shards using foreign sstable
open info.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For new resharding, it's important to exclude resharding sstables
from the list of candidates for regular compaction. That's doesn't
affect current resharding because it marks the sstables as
compacting. That won't work with new resharding which will work
with sstables from multiple shards.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Strategies other than leveled will reshard one shared sstable at
a time, and the target shard, shard at which job will run, for each
job will be chosen in a round-robin fashion.
For leveled strategy, we will reshard together smp::count adjacent
sstables that belong to same level.
The reason for that is because resharding one sstable at a time
may result in creation of file for each shard, meaning after
resharding we could end up with NO_SSTABLES*NO_SHARDS.
These resharding strategies will be used for our new resharding
algorithm.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We need that info for opening a sstable at different shard, unlike
sstable loader which has everything in entry_descriptor, obtained
from components in sstable filename.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It will be useful for resharding which will need to move a
sstable across shards, and to do that without reloading the
sstable at target shard, we need to be able to get the open
info and move it to the target shard instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's gonna be useful to quickly determine if it's worth resharding
a column family.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
A large token span sstable may find its way into high level due to resharding,
which means the strategy invariant is broken. The invariant is restored by
compacting first set of overlapping sstables, meaning that the restoration
is done incrementally for multiple overlapping sets.
Invariant is restored by regular compaction after resharding puts new unshared
sstables into their original level, where level > 0.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Extends compaction for new resharding algorithm. Not wired yet.
New resharding will compact shared sstable(s) and create one
sstable for each owner. It's up to the caller to open these
new unshared sstables at their respective column families.
This new approach will save a lot of bandwidth because we'll
no longer read the entire shared sstable #smp::count times.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's possible by shard A explicitly saying that sstable is created
for shard B. If we don't do that, sharding metadata isn't correct,
and consequently sstable will report wrong owners.
We'll need this for resharding which will create sstables for all
shards that own the shared sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compacting_sstable_writer only allowed one writer so far, but we will need
multiple ones for resharding.
It's done by moving writer management to compaction.
finish_sstable_writer() is added for compaction impl to stop all writers,
whereas stop_sstable_writer() will only stop current writer (needed when
current sstable reaches max limit size for example).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
No need for compacting_sstable_writer to store items that are available
in compaction class. Also, that's a step towards supporting multiple
writers for compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compact_sstables() supported both regular and cleanup compaction,
but with lots of conditions that made it ugly and hard to extend.
In the future, we want to introduce a new type of compaction for
resharding that will create one sstable for every shard owning
the sstable(s) given as input. That will be easier now.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
* seastar 2eec212...194d80f (4):
> removing the collectd tests
> fix fstream metrics reporting.
> do_for_each: Make it check for need preempt
> core/sharded: introduce copy method to foreign_ptr
"Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
Statistics for reclamation pauses for a read workload over
larger-than-memory data set:
Before:
avg = 865.796362
stdev = 10253.498038
min = 93.891000
max = 264078.000000
sum = 574022.988000
samples = 663
After:
avg = 513.685650
stdev = 275.270157
min = 212.286000
max = 1089.670000
sum = 340573.586000
samples = 663
Refs #1634."
* tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev:
lsa: Reduce reclamation latency
tests: Add test for log_histogram
log_histogram: Allow non-power-of-two minimum values
lsa: Use regular compaction threshold in on-idle compaction
tests: row_cache_test: Induce update failure more reliably
lsa: Add getter for region's eviction function
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
Statistics for reclamation pauses for a read workload over
larger-than-memory data set:
Before:
avg = 865.796362
stdev = 10253.498038
min = 93.891000
max = 264078.000000
sum = 574022.988000
samples = 663
After:
avg = 513.685650
stdev = 275.270157
min = 212.286000
max = 1089.670000
sum = 340573.586000
samples = 663
Refs #1634.
Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
We will want to reuse the min_size mechanism for the whole compaction
threshold, including the occupancy threshold. That threshold is close
to the segment size and we cannot pick a power of two which would be
close enough to what we need.
Therefore, change log_histogram to support arbitrary minimum base.
bucket_of() was moved into log_histogram_options so that it can be used
in number_of_buckets(), which makes for a simple and much less
error-prone implementation.
Idle-time compaction should not produce not-compactible segments
becuase that means we would have to evict a lot when we finally need
to reclaim some memory, so that occupancy falls below the regular
compaction threshold. This may cause latency spikes.
Refs #1634.
After changing region evicitability condition to be less strict, cache
update stopped failing because reclamation was able to compact dense
region. Induce failure by installing evictor which refuses to evict
from cache beyond few elements.
"This series makes several optimizations to sstable mutation reader relevant
for large partitions.
Some highlights:
One optimization is to use the index for skipping across clustering restrictions.
Currently we read whole partition in such cases. That includes the case when
we need to read a static row and then jump to some clustering row in the
middle of the partition. Another case is having more than one clustering
restriction, e.g. selecting multiple single rows from the same partition.
Another optimization is using information from the index for creation of
streamed_mutation. That can save us the cost of reading the partition header
form the data file in case we would not continue reading, but skip to the
middle of that partition. Or we may not even attempt to read anything from
that partition, if after we determine the key that reader will be put behind
other readers, which will exhaust the query limit first.
Another optimization is switching single-partition queries to use the
index_reader infrastructure. Index lookups via index_reader are faster than
find_disk_ranges(). This is also a cleanup, a step towards converting all code
to use the index_reader."
* tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits)
sstables: Remove unused code
sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
sstables: mutation_reader: Use index_reader for single-partition reads
sstables: mutation_reader: Add trace-level logging
sstables: mutation_reader: Move partition reading code to sstable_data_source
sstables: mutation_reader: Move definitions out of the class body
sstables: Move binary_search() to a header
database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating
sstables: index_reader: Introduce advance_to_next_partition()
sstables: index_reader: Introduce advance_and_check_if_present()
sstables: index_reader: Introduce advance_past()
sstables: index_reader: Make copyable
sstables: index_reader: Optimize advancing to extreme positions
sstables: index_reader: Keep two last pages alive
dht: ring_position_view: Add key getter
dht: ring_position_view: Add constructor and factory from ring_position_view
sstables: mutation_reader: Advance to next partition using index in some cases
sstables: index_reader: Expose access to partition key and tombstone
sstables: index_reader: Introduce promoted_index_view
sstables: mutation_reader: Move _index_in_current to sstable_data_source
...
This switches single-partition query to use the index_reader
infrastructure. Index lookups via index_reader are faster than
find_disk_ranges().
perf_fast_forward, rows: 1000000, value size: 100
Before:
Testing forwarding with clustering restriction in a large partition:
pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu
no 0.002182 2 916 3 152 2 0 0 1 1 88.1%
After:
Testing forwarding with clustering restriction in a large partition:
pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu
no 0.000758 2 2639 3 152 2 0 0 1 1 48.6%
This is also a cleanup, a step towards converting all code to use the
index_reader.
There are instantiations of binary_search() used in sstables.cc, but
defined in partition.cc. The instantiations are explicitly declared in
partition.cc, but the types changed and they became obsolete. The
thing worked because partition.cc also instantiated it with the right
type. But after that code will be removed, it no longer would, and we
would get a linker error. To avoid such problems, define
binary_search() in a header.
The idea behind caching is that when we have two index readers where
one is catching up with the other, each page will be read only
once. Currently that's not always the case. There is a case when
advance_to() may need to read two pages. That's when the target
position is not found in the first page as determined by the summary
index. The second reader which catches up would have to read the first
page as well, but it would not be in cache any more. To avoid this
extra I/O let's keep a reference to the two last pages touched by the
index.
To produce a streamed_mutation for the next partition, we need to read
its key and the tombstone. Currently we always do that by consuming
the partition header from the data file. In some cases that may cause
unnecessary IO.
It's better to obtain partition information from the index if we
already have it. We can save on IO if the user will skip past the
front of partition immediately after.
It is also better to pay the cost of reading the index if we know that
we will need to use the index anyway soon. This patch predicts that by
checking if there are any clustering restrictions. If there are any,
we will almost surely need_skip() and use the index anyway.
This change also lays the ground for unification of multi and single
partiton queries without loss of performance.
sstable_data_source holds a shared state between mutation_reader and
streamed_mutation for sstables. The information whether index is in
current partition will have to be accessed by both in the following
patches.
Before the change, the following scenario was happening:
1) we try to skip based on clustering restrictions
2) we find the page and fast forward to it, recording walker's
lower bound counter
3) we read the first fragment, it's not a tombstone, so we reset the walker,
and its lower bound counter too
4) the fragment is not in range (the range starts in the middle of the page)
5) needs_skip() is true, we redo the index lookup, which wastes some CPU
This change fixes the problem by avoiding resetting the walker. We can
do that because leading tombstones are checked with a non-mutable
contains_tombstone()
Allows detecting changes of lower_bound().
Result of advance_to() is not enough. When we get false from
advance_to() twice in a row, lower bound may or may not have changed.
Simplifies the code a bit, but also will make it easier to calculate
the next position we should skip to after forwarding, taking into
consideration both the position forwarded to as well as clustering
ranges of the query. That will be just calling
_ck_ranges_walker->lower_bound() after it is trimmed.
In general mp_row_consumer has better information about the next
position to read. It could be after the position we forward to if
there are clustering restrictions. This will be exploited in the
following patches.
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.
Fixes#1156.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417233125.14092-1-raphaelsc@scylladb.com>
Problem is that column family field of task wasn't being set for resharding,
so column family wasn't being properly removed from compaction manager.
In addition to fixing this issue, we'll also interrupt ongoing compactions
when dropping a column family, exactly like we do with shutdown.
Fixes#2291.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170418125807.7712-1-raphaelsc@scylladb.com>
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.
Fixes#2287.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
"This series fixes some errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler. A few more fixes are needed to produce
a binary."
* tag 'clang/1/v1' of https://github.com/avikivity/scylla:
logalloc: avoid auto in function argument declaration
thrift: avoid auto in function argument declaration
streamed_mutation: fix non-POD argument to C-style variadic function
mutation_partition_serializer: avoid auto in function argument declaration
date: use correct casts for years
streaming: avoid auto in function argument declaration
repair: avoid auto in function argument declaration
gms: expose gms::inet_address streaming operator
murmur3_partitioner: fix build on clang
i_partitioner: remove unused function
byte_ordered_partitioner: fix bad operator precedence
result_set: pass comparator by reference to std::sort()
to_string: move standard container overloads of to_string to std:: namespace
cql_type: fix bad enum syntax on clang
build: disable more warnings for clang
build: fix detection of unsupported warnings on clang
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides. Replace with the right type.
Since the right type is private, add some friendship.
Clang warns that passing a non-POD to a C-style variadic function will
result in an abort(). That happens to be exactly what we want, but to
silence the warning, use a template instead. Since templates aren't
allowed in local classes, move the containing class to namespace scope.
The standard says, and clang enforces, that declaring a function via
a friend declaration is not sufficient for ADL to kick in. Add a namespace
level declaration so ADL works.
Clang complains about some error without it, I could not understand it, but
I'm not going to argue with it.
Since std::sort() will copy the comparator, it's better to pass using an
std::ref(), and everyone is happy.
Argument-dependent lookup will not find to_string() overloads in the global
namespace if the argument and the caller are in other namespaces.
Move these to_string() overloads to std:: so ADL will find them.
Found by clang.
The diagnostic that clang spits out when it sees an unrecognized warning
is itself a warning, so the test compilation succeeds and we don't notice
the warning is not supported.
Adding -Werror turns the warning about the unrecognized warning into an
error, allowing the detection machinery to work.
The blocked task detector introduced in
113ed9e963 was seeing
the initialization phase of perf_ssttable as a blocked
task.
Tranform this part of the code in a futurized loop
to make to blocked task detector happy.
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170413132506.17806-1-benoit@scylladb.com>
This patch allows the check version to support private repositories.
If a repository file is passed as a parameter, the repository id will be
passed passed as a parameter when checking version.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"The version of create_index_statement class that was translated to C++
is pretty old by now. This series of cleanups brings it closer to Apache
Cassandra trunk to make it easier to bring over more secondary index
code to Scylla."
* 'penberg/create-index-stmt-cleanup/v1' of github.com:cloudius-systems/seastar-dev:
cql3/statements/create_index_statement: Move target validation
cql3/statements/create_index_statement: Remove static column validation
cql3/statements/create_index_statement: Extract validations
cql3/statements/create_index_statement: Kill bogus custom validation
cql3/statements/create_index_statement: Add materialized view to validate()
cql3/statements/create_index_statement: Remove validation
We take a reference of endpoint_state entry in endpoint_state_map. We
access it again after code which defers, the reference can be invalid
after the defer if someone deletes the entry during the defer.
Fix this by checking take the reference again after the defering code.
I also audited the code to remove unsafe reference to endpoint_state_map entry
as much as possible.
Fixes the following SIGSEGV:
Core was generated by `/usr/bin/scylla --log-to-syslog 1 --log-to-stdout
0 --default-log-level info --'.
Program terminated with signal SIGSEGV, Segmentation fault.
(this=<optimized out>) at /usr/include/c++/5/bits/stl_pair.h:127
127 in /usr/include/c++/5/bits/stl_pair.h
[Current thread is 1 (Thread 0x7f1448f39bc0 (LWP 107308))]
Fixes#2271
Message-Id: <529ec8ede6da884e844bc81d408b93044610afd2.1491960061.git.asias@scylladb.com>
This introduce offline installer generator.
It will generate self-extractable archive witch contains Scylla packages
and dependency packages.
Package installation automatically starts when the archive executed.
Limitation: Only supported CentOS at this point.
Fixes#2268
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491997091-15323-1-git-send-email-syuu@scylladb.com>
"This series was initially meant to only transition the keyspace based backend to work
on top of prepared statements but there were a few potential issues found on the way.
In addition the original Tracing series has been expanded with a few patches in the cql3 layer
that are improving the generic clq3 layer but are not obvious without the context of the
following Tracing patches.
The "main" patch contains a heavy rework of trace_keyspace_helper:
- Use prepared statements for updating tables instead of manually constructing mutations:
- We intentionally decrease the level of code robustness from "paranoid" to "normal".
- The code gets a lot more simple, e.g. we don't need to cache columns definitions any more.
- We are loosing some performance here but:
- Tracing write is not in the fast path.
- Tracing write events should be rare.
- Currently the performance loss (for the actual write time of all trace records) for a "SELECT" query with a specific key is
about 45%: 144us vs 99us."
* 'tracing_rework_using_prepared-v6' of github.com:cloudius-systems/seastar-dev:
tracing: use prepared statment for updating tables
tracing::trace_keyspace_helper: add a bad_column_family constructor that accepts an std::exception parameter
tracing::trace_keyspace_helper: introduce a table_helper class
tracing::trace_keyspace_helper: add static qualifier to make_monotonic_UUID_tp() and elapsed_to_micros() methods
tracing::tracing: allow slow query TTL only in the signed 32-bit integer range
cql3::query_processor::prepare(): futurize the error case
cql3::query_options: add a factory method for creation of options for a BATCH statement
cql3::statements::batch_statement: add a constructor that doesn't receive the "bound_terms" value
cql3::query_processor: use weak_ptr for passing the prepared statements around
When compacting a fully expired sstable, we're not allowing that sstable
to be purged because expired cell is *unconditionally* converted into a
dead cell. Why not check if the expired cell can be purged instead using
gc before and max purgeable timestamp?
Currently, we need two compactions to get rid of a fully expired sstable
which cells could have always been purged.
look at this sstable with expired cell:
{
"partition" : {
"key" : [ "2" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 120,
"liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z",
"ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true },
"cells" : [
{ "name" : "country", "value" : "1" },
]
now this sstable data after first compaction:
[shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79
(~65% of original) in 229ms = 0.000328997MB/s.
{
...
"rows" : [
{
"type" : "row",
"position" : 79,
"cells" : [
{ "name" : "country", "deletion_info" :
{ "local_delete_time" : "2017-04-09T17:07:12Z" },
"tstamp" : "2017-04-09T17:07:12.702597Z"
},
]
now another compaction will actually get rid of data:
compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original)
in 1ms = 0MB/s. ~2 total partitions merged to 0
NOTE:
It's a waste of time to wait for second compaction because the expired
cell could have been purged at first compaction because it satisfied
gc_before and max purgeable timestamp.
Fixes#2249, #2253
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>
In addition to actually moving to using the prepared statements the changes also include:
- Kill the cache_xxx() methods - the schema is going to be checked during the
prepared statement creation and during its execution.
- Move the caching of table ID and the prepared statement to the get_schema_ptr_or_create().
- Rename: get_schema_ptr_or_create() -> cache_table_info().
After these changes we are less strict in our demands to system_traces tables schemas, e.g.
if some column's type is not exactly as we expect but rather only "compatible" in the CQL sense
we will tolerate this and will continue to write into that table.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
An object built with this constructor will use the what() message from the
given exception in the final error message.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This class contains a general table info and implements standard operations on this table:
- Creation.
- Info caching.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Extract specific validations to separate functions to preserve the same
structure as Apache Cassandra code and make it easier to add support for
multiple index targets.
Rejecting custom indices is bogus because it's just a configuration
mechanism like replication strategy, for example. Furthermore, it's
needed for SASI indices, which we likely need to be compatible with.
The validation was removed in Apacha Cassandra commit 0626be8 ("New 2i
API and implementations for built in indexes"). Let's also remove it
from our code so that we remove one dependency to the obsolete db/index/
code.
Any TTL is eventually converted into the gc_clock::duration value, which
is based on int32_t type. Limit the node_slow_log TTL user configurable value
to the same values range for consistency.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Make sure that errors are reported in a form of an exceptional future and
not by a direct exception throwing.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This constructor should be used when we know that there are no bound terms in the current
batch statement.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Use seastar::checked_ptr<weak_ptr<pepared_statement>> instead of shared_ptr for passing prepared statements around.
This allows an easy tracking and handling of statements invalidation.
This implementation will throw an exception every time an invalidated
statement reference is dereferenced.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
segment_zone::migrate_all_segments() was trying to migrate all segments
inside a zone to the other one hoping that the original one could be
completely freed. This was an attempt to optimise for throughput.
However, this may unnecesairly hurt latency if the zone is large, but
only few segments are required to satisfy reclaimer's demands.
Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>
While cpuset.conf is supposed to be set before this is used, not having
a cpuset.conf at all is a valid configuration. The current code will
raise an exception in this case, but it shouldn't.
Also, as noted by Amos, atoi() is not available as a global symbol. Most
invocations were safe, calling string.atoi(), but one of them wasn't.
This patch replaces all usages of atoi() with int(), which is more
portable anyway.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170407172937.17562-1-glauber@scylladb.com>
We have recently fixed the ami init scripts to mark i3 as a supported
instance. However, the code to detect whether or not the instance is
supported is duplicated, and called from multiple locations. That means
that when the user logs in, it will see the instance as not supported -
as the test is coming from a different source.
This patch moves it to the scylla_lib.sh utilities script, so we can
share it, and make sure it is right for all locations.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406201137.8921-1-glauber@scylladb.com>
Older versions of packer do not support ENA, and when faced
with the option
"enhanced_networking": true,
will only actually enable it for the older 82599 VF instances.
Fortunately, packer 1.0 already supports it, and all we have to do is
update it.
While we are at it, let's check if the file is legit before using a
random file we have downloaded from the internet, to avoid breaching our
building process.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406123952.14708-1-glauber@scylladb.com>
For the debian package, the files don't have to be listed individually,
but for RPMs they do.
The rpm builds are currently failing with:
error: Installed (but unpackaged) file(s) found:
/usr/lib/scylla/scylla_util.py
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406012336.21081-1-glauber@scylladb.com>
Following C* API there are two APIs for getting the load from
storage_service:
/storage_service/metrics/load
/storage_service/load
This patch adds the implementation for
/storage_service/metrics/load
The alternative, is to drop on of the API and modify the JMX
implementation to use the same API.
Fixes#2245
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170401181520.19506-1-amnon@scylladb.com>
Currently we return proceed::no after every mutation_fragment which is
to be consumed. This froces parser to save and reload its state
often. This can be avoided if we pushed the fragments directly from
mp_row_consumer, then we would return proceed::no only when the buffer
fills up.
tests/perf/perf_fast_forward shows 15% increase in throughput of a large partition scan,
from 1.34M frag/s to 1.55M frag/s.
Message-Id: <1490882700-22684-1-git-send-email-tgrabiec@scylladb.com>
rpmbuild tries to compile any *.py by default, but it causes compilation
error on python3 code when python2 is system default (CentOS, RHEL).
So skip compiling it, drop .pyc / .pyo from the package.
Fixes#2235
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1490859768-18900-1-git-send-email-syuu@scylladb.com>
"Also, this new version supports i3.
Number of requests for i3 is obtained similarly as to i2: I have run tests
for a single disk, and then we'll take the amount of disks into account. Other
possible limits are also taken into account, like the max per-shard seastar limit
of 128 in-flight request, and the per-disk limit obtained by sysfs."
* 'python-io-setup' of https://github.com/glommer/scylla:
rewrite scylla_io_setup in python
scripts: add python module with common utilities
As we convert more stuff to python, we'll have more opportunities for
sharing code between them. We already do that for the bash scripts with
a file "scylla_lib.sh". We'll do the same for python.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We recently introduced fstrim cronjob / systemd timer unit, but some of
distributions already has their own fstrim cronjob / systemd timer unit.
So let's use them when it's posssible.
Fixes#2233
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491226727-10507-1-git-send-email-syuu@scylladb.com>
After RAID devices mounted to /var/lib/scylla, scylla-server doesn't able to
find ./conf/ directory since we haven't created symlinks on the volume.
Instead of creating symlink on scylla_raid_setup, let's specify scylla.yaml
path on program argument.
Fixes#2236
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491056110-1078-1-git-send-email-syuu@scylladb.com>
Running nodetool status on each login to Scylla AMI helps in three ways:
- give the user a quick view of the node and cluster status beyond the current "scylla is active"
- hint to the user about the nodetool and how to use it
- move the first, slow, run of nodetool to the login phase, making the second interactive run much faster
on the down side, it does slow the login in a few sec
Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
[ penberg: fix formatting ]
Message-Id: <20170330081711.22038-1-tzach@scylladb.com>
This script validates schemas of distributed system keyspaces:
system_traces and system_auth.
It tries to add the missing columns and checks that the existing columns
have expected types.
In case of any problem the corresponding info message is printed and non-zero exit
status is returned.
Extra columns are ignored.
The validation function may also be called from another Python program.
It returns True in case of success and False otherwise.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490195645-29905-1-git-send-email-vladz@scylladb.com>
"sstable_streamed_mutation::fast_forward_to() is changed to use promoted index
(via index_reader) to optimize skipping in large partitions.
In addition to that, sstable mutation_reader is changed to use the index
to skip to the next partition.
Performance impact was evaluated using newly added tests/perf/perf_fast_forward
What's beyond this series:
- Using index_reader for single-partition reads as well
- Using index_reader for skipping across ranges in clustering restrictions"
* tag 'tgrabiec/skip-within-partition-using-index-v2' of github.com:cloudius-systems/seastar-dev: (47 commits)
tests: Add performance test for fast forwarding of sstable readers
tests: Allow starting cql_test_env on pre-existing data
config: Allow specifying source when setting value
tests: sstable: Add test for fast forwarding within partition using index
sstables: sstable_streamed_mutation: use index in fast_forward_to()
sstables: Store parsed promoted index in index_entry
sstables: Add trace-level logging for sstable consumption
sstables: Define deletion_time earlier
sstables: Make parsing throw exception on malformed promoted index
tests: Add tests for ordering of position_in_partition relative to composites
position_range: Introduce all_clustered_rows() factory method
position_in_partition: Introduce for_key()/after_key() factory methods
position_in_partition: Add factory methods for positions around all rows
position_in_partition: Introduce for_range_start()/for_range_end()
position_in_partition: Fix friendship declaration
keys: Introduce is_empty() for prefixes
position_in_partition: Make comparable with composites
types: Enhance lexicographical comparators
compound_compat: Accept marker value in serialize_value()
compound_compat: Add trichotomic comparator
...
quick introduction to level starvation:
high levels may be left uncompacted (thus starved) for a long time if user
makes something that make they contain little data, such as cleanup or change
of max sstable size (default 160M). Leveled strategy handles this problem as
follow: consider we're compacting L1 to L2. If L3 is starved, we look for one
of its sstable that is fully contained in token range of candidates L1->L2,
so that we won't end up with an overlapping in L2.
now the problem:
the functionality isn't working properly now because range of candidates is
being incorrectly calculated due to an accident when converting the code to
C++. It won't cause an overlap because it's actually being more restrictive
about which sstable from starved level can be used.
A test case was added to confirm the problem.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170328223753.15398-1-raphaelsc@scylladb.com>
So that is_set() will be true for that option. Needed in tests which
set some config options in higher layer and then lower layers detects
if option was set or not before applying its default.
Will be easier to propagate failure to upper layers once parsing is
reused in the index_reader.
The old behavior of ignoring parsing failures is preserved, but the
error is logged now.
They now accept optional lexicographical_relation which can be used
to alter position of the element relative to elements prefixed by it.
Example. Let's consider lexicographical ordering on strings. The
position of "bc" in a sample sequence is affected by
lexicographical_realtion as follows:
aa
aaa
b
ba
--> before_all_prefixed
bc
--> before_all_strictly_prefixed
bca
bcd
--> after_all_prefixed
bd
bda
c
ca
The iterator doesn't really allow modifying unserlying component.
This change enables using the iterator with boost::make_iterator_range() and
boost::range::join(), which get utterly confused otherwise.
Direct motivation for this is to be able to use two index readers from
a single mutation reader, one for lower bound of the range and one for
the upper bound of the range, without sacrificing optimization of
avoiding index reads when forwarding to partition ranges which are
close by. After the change, all index readers of given sstable will
share index buffers, so lower bound reader can reuse the page read by
the upper bound reader.
The reason for using two readers will be so that we are able to skip
inside the partition range, not only outside of it. This is not
possible if we use the same index reader to locate the upper bound of
the range, because we may only advance the cursor.
Needed to make the index_list copyable, which is going to be needed to
implement legacy get_index_entries() which returns by value, after
index sharing is implemented.
Index reader already can be queried only with monotonic positions, so
the concept of a cursor is ingrained. Making it explicit will make it easier
to define behavior for forwarding withing the partition.
After the change:
- lower_bound() is renamed to advance_to() and doesn't return
the position, only advances the cursor
- data file position for partition under cursor can be obtained
at any time with data_file_position()
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.
This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.
Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.
I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.
v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>
This patch ensures we generate UUIDs using the same randomness source
as all the other values we randomly generator, so that we can get a
deterministic run from the seeds we print.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170327161149.8938-2-duarte@scylladb.com>
* seastar 6b21197...2ebe842 (6):
> Merge "Various improvements to execution stages" from Paweł
> app-template: allow apps to specify a name for help message
> bool_class: avoid initializing object of incomplete type
> app-template: make sure we can still get help with required options
> prometheus: Http handler that returns prometheus 0.4 protobuf or text format
> Update DPDK to 17.02
Includes patch from Pawel to adjust to updated execution_stage interface.
Users sometimes need to run their own yaml configuration files, and it
is currently annoying to deploy modified files on docker.
One possible solution is to bind mount the file into the docker
container using the -v switch, just like we already do for for the data
volume.
The problem with the aforementioned approach is that we have to change
the yaml file to insert the addresses, and that will change the file in
the host (or fail to happen, if we bind mount it read-only).
The solution I am proposing is to avoid touching the yaml file inside
the container altogether. Instead, we can deploy the address-related
arguments that we currently write to the yaml file as Scylla options.
Fixes#2113
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1490195141-19940-1-git-send-email-glauber@scylladb.com>
If early startup fails in docker-entrypoint.py, the container does not
start. It's therefore not very helpful to log to a file _within_ the
container...
Message-Id: <1490275943-23590-1-git-send-email-penberg@scylladb.com>
Fixes#2173
Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.
Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.
Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
Don't report a Tracing session ID unless the current query had a Tracing bit in its
flags.
Although the current master's behaviour is legal it's suboptimal and some Clients are sensitive to that.
Let's fix that.
Fixes#2179
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490063752-8915-1-git-send-email-vladz@scylladb.com>
column_family constructor uses delegation, as such, only the actual
constructor implementation should contain a call to register the
metrics.
Current implementation ends up with re registration of the metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170320140817.22214-1-amnon@scylladb.com>
This patch exposes Scylla's Prometheus port by default. You can now use
the Scylla Monitoring project with the Docker image:
https://github.com/scylladb/scylla-grafana-monitoring
To configure the IP addresses, use the 'docker inspect' command to
determine Scylla's IP address (assuming your running container is called
'some-scylla'):
docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla
and then use that IP address in the prometheus/scylla_servers.yml
configuration file.
Fixes#1827
Message-Id: <1490008357-19627-1-git-send-email-penberg@scylladb.com>
On Ubuntu 14.04, the lsblk doesn't have '-p' option, but
`scylla_setup` try to get block list by `lsblk -pnr` and
trigger error.
Current simple pattern will match all help content, it might
match wrong options.
scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e -p
-m, --perms output info about permissions
-P, --pairs use key="value" output format
Let's use strict pattern to only match option at the head. Example:
scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e '^\s*-D'
-D, --discard print discard capabilities
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <4f0f318353a43664e27da8a66855f5831457f061.1489712867.git.amos@scylladb.com>
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.
Fixes#192.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>
The mapping between a base table update and a view update is schema
dependent, so we need to ensure the view schema versions match the
base schema version. For example, we match base columns to view
columns by name, so we need to ensure the base and view schemas we're
using for writting are isolated with respect to a previous alter
table statement.
We thus need to match base schema versions with view schema versions,
and we need to so atomically to ensure that when one fiber sees a
schema, it also sees the complete set of corresponding view schemas.
This series ensures the schemas modified as a result of an alter
table statement are published atomically, under the schema lock. This
way, all the schemas referenced by the database are consistent with
each other when they are observed by other fibers.
Finally, we upgrade the mutation schema before generating the view
updates, to ensure it matches the most recent view schemas the base
replica knows about, registered in the database.
The db::view::view class was replaced by a set of non-member
functions, with its state, which used to reflect only the most recent
schema version, being moved to a new view_info class.
One of the goals of can_allocate_more_memory() is to prevent depleting
seastar's free memory close to its minimum, leaving a head room above
that minimum so that standard allocations will not cause reclamation
immediately. Currently the function doesn't take into accoutn actual
threshold used by the seastar allocator, so there could be no gap or
even could go below the minimum.
Fix that by ensuring there's always a gap above min_free_memory().
min_gap was reduced to 1 MiB so that low memory setups are not
impacted significantly by the change.
Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>
* seastar 4d25b85...6b21197 (3):
> core: memory: Expose control of the free memory low water mark
> scripts: add perftune.py
> tutorial: make network examples work on multi-core
"The test allocates objects in batches (allocation is always under a reclaim
lock) of ~3MiB and assumes that it will always succeed because if we cross the
low water mark for free memory (20MiB) in seastar, reclamation will be
performed between the batches, asynchronously.
Unfortunately that's prevented by can_allocate_more_memory(), which fails
segment allocation when we're below the low water mark. LSA currently doesn't
allow allocating below the low water mark.
The solution which is employed across the code base is to use allocating_section,
so use it here as well.
Exposed by recent consistent failures on branch-1.7."
* 'tgrabiec/fix-lsa-async-eviction-test' of github.com:cloudius-systems/seastar-dev:
tests: lsa_async_eviction_test: Allocate objects under allocating section
lsa: Allow adjusting reserves in allocating_section
This patch ensures we upgrade the mutation to the current schema when
generating and pushing view updates, so that the it matches the most
up to date views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures that the schema merging atomically publishes
schema changes. In particular, it ensures that when a base schema
and a subset of its views are modified together (i.e., upon an alter
table or alter type statement), then they are published together as
well, without any deferring in-between.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the migration path for table updates such that the
base table mutations are sent and applied atomically with the view
schema mutations.
This ensures that after schema merging, we have a consistent mapping
of base table versions to view table versions, which will be used in
later patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The write path uses a base schema at a particular version, and we
want it to use the materialized views at the corresponding version.
To achieve this, we need to map the state currently in db::view::view
to a particular schema version, which this patch does by introducing
the view_info class to hold the state previously in db::view::view,
and by having a view schema directly point to it.
The changes in the patch are thus:
1) Introduce view_info to hold the extra view state;
2) Point to the view_info from the schema;
3) Make the functions in the now stateless db::view::view non-member;
4) Remove the db::view::view class.
All changes are structural and don't affect current behavior.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In preparation for upcoming patches, which will deal with
moving the state in db::view::view to view_info.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"The current implementations of collection_type_impl::is_empty() and
collection_type_impl::difference() don't handle tombstoned collection
mutations correctly. In particular:
- is_empty() considers a collection mutation with a tombstone (and no
entries) as empty;
- difference() doesn't do set difference between the cells tombstones,
and always returns the highests.
Fixes#2152"
* 'collection-diff/v4' of github.com:duarten/scylla:
mutation_test: Add more test cases for difference()
mutation_source_test: Randomly generate collection cells
collection_type_impl: Use set difference for tombstones
collection_type_impl: A mutation with a tombstone is not empty
This patch fixes collection_type_impl::difference() so it does set
difference for tombstones instead of just returning the larger
one, as difference() is supposed to return only the information in
mutation A that supersedes that in B, given difference(A, B).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the collection_type_impl::is_empty() function so
that it doesn't consider empty a collection_mutation which has a
tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Discarding blocks on large RAID volume takes too much time, user may suspects
the script doesn't works correctly, so it's better to skip, do discard directly on each volume instead.
Fixes#1896
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1489533460-30127-1-git-send-email-syuu@scylladb.com>
Fixes#2098
Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.
v2:
* Fixed whitespace errors
Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
Metrics name should be unique per type.
requests_blocked_memory was registered twice, one as a gauge and one as
derived.
This is not allowed.
Fixes#2165
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314162826.25521-1-amnon@scylladb.com>
* seastar fd29fd0...4d25b85 (2):
> core/file: fix EOF detection for file with custom impl
> tutorial: fix echo server example
Includes patch from Raphael updating checked_file_impl:
"Now file_impl requires dma_read_bulk to be implemented, and for
checked_file_impl, it only's about calling dma_read_bulk from
the posix file it wraps."
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.
Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.
This could be related to scylladb/seastar#250
Fixes#2163.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
* seastar 84a0b70...fd29fd0 (4):
> Fix smp::submit_to() with function reference
> execution_stage: add concept restraint for operator()
> core/temporary_buffer: Add operator==()
> map_reduce: allow reducer to take accumulated value by rref
"This replaces use of a generic forwarding wrapper in sstable reader with
specialized implentation. Forwarding doesn't yet utilize indexes in this
series, only integrates it with mp_row_consumer, which is a prerequisite.
It's still an optimization, since mp_row_consumer will not try to consume
past the range as it used to.
Sending early for easier consumption."
* tag 'tgrabiec/forwarding-of-mp-row-consumer-v2' of github.com:scylladb/seastar-dev:
sstables: Remove use of forwarding wrapper
sstables: Implement sstable_streamed_mutation::fast_forward_to()
sstables: Extract and use clustering_ranges_walker
tests: sstables: Add test for handling of repeated tombstones
sstables: Extract writer parameters into config objects
tests: Move as_mutation_source() helper to header
tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions
streamed_mutation: Add streamed_mutation_returning() helper
tests: mutation_source_test: Add test case for forwarding to a full range
tests: simple_schema: Add fragment factories
tests: Extract simple_schema
sstables: Move workaround for out-of-order range tombstones to mp_row_consumer
sstables: Drop default mp_row_consumer constructor
sstables: Swap order of values in "proceed" so that "no" is assigned 0
util/optimized_optional: Make printable
position_in_partition: Add is_static_row() in the view
range_tombstone_stream: Add reset()
range_tombstone_stream: Add get_next(position_in_partition_view)
sstables: streamed_mutation: Stop reading when end of slice reached
sstables: Switch is_in_range() to position_in_partition
Handling of forwarding is done inside mp_row_consumer, because it
allows us to filter out irrelevant data sooner and thus more
efficiently.
Becuase static row can be now skipped as well, _skip_clustering_row
was renamed to more generic _skip_in_progress.
This is a preliminary step before adding support for fast-forwarding
to mp_row_consumer, so that range handling can be solely in
mp_row_consumer rather than split between it and
sstable_streamed_mutation.
This also alleviates #2080 by reading all tombstones only up to the
first row, after that range tombstones are treated like other
fragments.
As part of this change, skip detection detection is refactored. This
simplifies reasoning about mp_row_consumer's state a bit because now
is_mutation() is not reset externally and only depends on current
position of the reader.
It will prove useful when we extend mutation reader to decide if it
should skip to the next partition up front before calling
_context.read(), so that we can for instance skip using index instead.
Fixes#2088.
This patch changes a lambda argument type so the keyspace name is
passed by reference instead of copying it, in
read_schema_for_keyspaces().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213134.10331-1-duarte@scylladb.com>
Change linkage of segment_descriptor_hist_options to external to keep
good old GCC5 happy, despite C++11 allowing static linkage of non-type
template arguments.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213206.10383-1-duarte@scylladb.com>
Add batch_size_fail_threshold_in_kb to prevent huge batch from been
applied and causing troubles. Also do not warn or fail if only one
partition is affected.
Fixes: #2128
Message-Id: <20170309111247.GE8197@scylladb.com>
"These patches introduce execution stages to Scylla in order to improve
icache friendliness. The places were stages are added are not chosen
very carefully and rather introduced at between different subsystems:
cql, storage proxy and database. This already results in a rather
significant improvement and can be tuned later if necessary.
Performance results:
perf_simple_query -c4 --duration 60
(medians)
before after diff
write 83017.75 242876.04 +192.6%
read 61709.16 168258.26 +172.7%
The real life improvements aren't as good because it is much harder
to collect sufficiently high number of operations in a batch."
Additional benchmarking from Paweł:
"I did some tests on my local setup.
* Latency at light loads
Scylla running on 16 logical CPUs (8 cores) with 64 GB of RAM.
cassandra-stress -rate threads=32
write latency
master seda
median 1.2 0.6
95th 1.6 0.8
99th 1.7 0.9
99.9th 2.5 1.3
max 26.4 24.2
Flags '--poll-mode' and '--defragment-memory-on-idle false' didn't improve situation for master.
See also attached graph write_99.svg and write_999.svg.
read latency
master seda
median 0.8 0.6
95th 1.0 0.9
99th 1.1 1.0
99.9th 1.4 1.2
max 18.5 18.0
See also attached graph read_99.svg and read_999.svg.
* Server 100% loaded, dataset fitting in memory (throughput)
Scylla running on 2 cores with 64 GB of RAM.
4x scylla-bench with the uniform workload
(concurrency of each s-b: 512 for writes, 256 for reads).
There were no cache misses during reads.
master seda diff
writes 107722.4 168482.26 +56.4%
reads 51049.48 76158.19 +49.2%
* Server 100% loaded, writes being flushed and compacted (throughput)
Scylla running on 2 cores with 4 GB of RAM.
4x scylla-bench with the uniform workload, concurrency 256 each.
master seda diff
writes 79575.77 114206.11 +43.5%
See attached graph: writes_with_flushes_and_compaction.png (first run: master, second: seda)."
* tag 'pdziepak/scylla-execution-stages/v1-rebased' of github.com:cloudius-systems/seastar-dev:
transport: make process_request_one() an execution stage
mutation_query: add an execution stage
db: make database::query() an execution stage
db: make apply an execution stage
storage_proxy: make mutate() an execution stage
cql3: make batch statement an execution stage
cql3: make modification statement an execution stage
cql3: make select statement an execution stage
mutation_reader: make mutation_source nothrow movable
There is no need to wait when starting the prometheus server. As it is
up to each of the modules to register its metrics when it is ready.
This is especially important when debuging boot issues.
This patch moves the prometheus initilization to be done at an early
stage of the boot sequencec.
Fixes#2144
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1489041986-28974-1-git-send-email-amnon@scylladb.com>
We have:
auto halves = range.split(midpoint, dht::token_comparator());
We saw a case where midpoint == range.start, as a result, range.split
will assert becasue the range.start is marked non-inclusive, so the
midpoint doesn't appear to be contain()ed in the range - hence the
assertion failure.
Fixes#2148
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>
* seastar 5861f99...84a0b70 (13):
> build: don't error out on [[deprecated]] APIs
> Merge "Introduce execution stages" from Paweł
> Remove unused include statement
> http: catch and count errors in read and respond
> Merge "Adding metrics configuration" from Amnon
> future: add concepts for map_reduce(), when_all_succeed()
> doxygen: exclude c-ares directory
> scripts/posix_net_conf.sh: add --use-cpu-mask option
> file: take flush into account when calculating size for truncate in optimize_queue()
> Fixing the prometheus cleanup patch
> Merge "posix_net_conf.sh: better distribute ingress processing" from Vlad
> prometheus: code clean up
> future: relax finally() constraints even more
"If a node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.
To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.
Fixes #2129."
* tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev:
db: Create default auth and tracing keyspaces using lowest timestamp
migration_manager: Append actual keyspace mutations with schema notifications
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.
Fixes#2143
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
If the node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.
To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.
Fixes#2129.
There is a workaround for notification race, which attaches keyspace
mutations to other schema changes in case the target node missed the
keyspace creation. Currently that generated keyspace mutations on the
spot instead of using the ones stored in schema tables. Those
mutations would have current timestamp, as if the keyspace has been
just modified. This is problematic because this may generate an
overwrite of keyspace parameters with newer timestamp but with stale
values, if the node is not up to date with keyspace metadata.
That's especially the case when booting up a node without enabling
auto_bootstrap. In such case the node will not wait for schema sync
before creating auth tables. Such table creation will attach
potentially out of date mutations for keyspace metadata, which may
overwrite changes made to keyspace paramteters made earlier in the
cluster.
Refs #2129.
The current code is assymetric: the first N-1 shards to delete a set receive
a synthetic future to wait on, while the last deletion receives the result
of the delete operation (which also broadcasts completion to the first N-1
operations. This results, in case of an error, with the Nth future being
reported as an unhandled error.
Fix by making everything symmetric: all N callers receive a synthetic
future. Nobody waits for the deletion operation (which still broadcasts its
completion to all waiters, so errors are not lost).
Message-Id: <20170305151607.14264-1-avi@scylladb.com>
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.
Performance was tested using:
perf-simple-query -c4 --counters --duration 60
The following results are medians.
before after diff
write 18640.41 33156.81 +77.9%
read 58002.32 62733.93 +8.2%"
* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
cell_locker: add metrics for lock acquisition
storage_proxy: count counter updates for which the node was a leader
storage_proxy: use counter-specific timeout for writes
storage_proxy: transform counter timeouts to mutation_write_timeout_exception
db: avoid allocations in do_apply_counter_update()
tests/counters: add test for apply reversability
counters: attempt to apply in place
atomic_cell: add COUNTER_IN_PLACE_REVERT flag
counters: add equality operators
counters: implement decrement operators for shard_iterator
counters: allow using both views and mutable_views
atomic_cell: introduce atomic_cell_mutable_view
managed_bytes: add cast to mutable_view
bytes: add bytes_mutable_view
utils: introduce mutable_view
db: add more tracing events for counter writes
db: propagate tracing state for counter writes
tests/cell_locker: add test for timing out lock acquisition
counter_cell_locker: allow setting timeouts
db: propagate timeout for counter writes
...
logalloc::reclaim_lock prevents reclaim from running which may cause
regular allocation to fail although there is enough of free memory.
To solve that there is an allocation_section which acquire reclaim_lock
and if allocation fails it run reclaimer outside of a lock and retries
the allocation. The patch make use of allocation_section instead of
direct use of reclaim_lock in memtable code.
Fixes#2138.
Message-Id: <20170306160050.GC5902@scylladb.com>
"Work on this series started with fixing the 'nodetool clearsnapshot'.
The current master code ignores the snapshots in deleted keyspaces (issue #2045).
I noticed that in many places our code has to build the path to some directory/file
it simply had the sstring(<path1>) + "/" + sstring(<path2>) constructs which may cause us issues
if somebody decides to complile/run scylla on not-Unix-based OS, like Microsoft Windows.
I understand that this is a long shot but if we can make it right now - why not to.
The answer is boost::filesystem::path class - its synchronous parts, of course.
I decided to take an initiative and fix the issues above and then use the fixed code for
fixing the issue #2045:
- Fix some minor issues in the existing code.
- Extend the lister class and move it into the separate files outside database.cc.
On the way I've found an issue in the existing code (issue #2071).
This series fixes this one too (PATCH2)."
The current test for whether or not the filesystem is mounted is weak
and will fail if multiple pieces of the hierarchy are mounted.
util-linux ships with a mountpoint command that does exactly that,
so we'll use that instead.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488742801-4907-1-git-send-email-glauber@scylladb.com>
boost::split() return one empty string if called on an empty input.
Trying to cast an empty string to a token value results in a bad_lexical_cast
exception. Fix it by handling empty token list explicitly.
Message-Id: <20170302125405.GU11471@scylladb.com>
From Paweł:
These patches optimise commitlog_entry_writer so that it avoids copying
column mapping, which is a particularly expensive operation.
perf_simple_query -c4 --write --duration 60
(medians)
before after diff
write 79434.35 89247.54 +12.3%
Tested with:
commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup
commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table
commitlog_test.py:TestCommitLog.test_commitlog_replay_with_counters
The general algorithm for merging counter cells involves allocating a
new buffer for the shards. However, it is expected that most of the
applies are just updating the values of existing shards and not adding
new ones, therefore can be done in place.
However, reverting the general and in-place applies requires different
logic, hence the need for an additional flag to differentiate between
them.
Encountering tombstones while transforming counter update from deltas to
shards is expected to be rare due to the fact that counter cells cannot
be recreated once removed.
This assumption makes it unnecessary to care much about removed cells
during delta->shard transformation as it adds complexity to the code and
is not required to produce correct results.
This patch attempts to avoid excessive allocations and copies when
constructing counter cells using counter_cell_builder. That involves
adding serializer interface to atomic_cell so that the counter cell can
be directly serialized to the buffer allocated for atomic cell.
counter_cell_builder::from_single_shard() is added as well to avoid
std::vector<> overhead when creating a counter cell from a single shard.
By default behavior is kept the same. There are deployments in which we
would like to mount data and commitlog to different places - as much as
we have avoided this up until this moment.
One example is EC2, where users may want to have the commitlog mounted
in the SSD drives for faster writes but keep the data in larger, less
expensive and durable EBS volumes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488258215-2592-1-git-send-email-glauber@scylladb.com>
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.
I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.
Fixes the following UBSAN report:
/usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'
Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
In the later stages of counter write path a mutation is produced that
already has all cells transformed to counter shards and can be applied
to the memtable and written to the commitlog.
The current interface expectes a frozen mutation, which is suboptimal
for counters. The freeze itself is unaviodable -- it is required by
commitlog, but we can avoid later deserialization of frozen_mutation
when it is applied to the memtable if we pass the unfrozen mutation
along.
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.
Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.
Since gcc-5/stretch=5.4.1-2 removed from apt repository, we nolonger able to
build gcc-5.
To avoid dead link, use launchpad.net archives instead of using apt-get source.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488189378-5607-1-git-send-email-syuu@scylladb.com>
Fixes the following UBSAN warning:
core/semaphore.hh:293:74: runtime error: reference binding to misaligned address 0x0000006c55d7 for type 'struct basic_semaphore', which requires 8 byte alignment
Since the field was not initialied properly, probably also fixes some
user-visible bug.
Message-Id: <1488368222-32009-1-git-send-email-tgrabiec@scylladb.com>
This patch replaces the current heap with a logarithmic histogram
to hold the closed segment descriptors.
This histogram stores elements in different buckets according to
their size. Values are mapped to a sequence of power-of-two ranges
that are split in N sub-buckets. Values less than a minimum value
are placed in bucket 0, whereas values bigger than a maximum value
are not admitted.
There is some loss of precision as segments are now not totally
ordered, and precision decreases the more sparse a segment is. This
allows to reduce the cost of the computations needed when freeing
from a closed segment.
Performance results for perf_simple_query -c4 --duration 60
before after diff
read 43954.27 45246.10 +2.9%
write 48911.54 52807.76 +7.9%
Fixes#1442
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235328.27937-1-duarte@scylladb.com>
* seastar 4d4a58d...5861f99 (9):
> future: adjust finally constraint to allow any future to be returned from the continuation
> build: allow specifying the C compiler
> socket: Change signature (and impls) of socket shutdown to void
> reactor: give names to OS threads
> Concepts support
> core/file: Fix short-read in read_maybe_eof()
> core/fstream: Avoid issuing read requests beyond _remain
> tests: Improve assertion failure message
> reactor: Expose IO stats in a public API
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.
This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.
Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
before after diff
write 79434.35 89247.54 +12.3%
"This introduces an API which allows forward navigation in a stream of mutation
fragments. It allows one to consume only a subset of the stream by iteratively
specifying sub-ranges from which fragments should be returned.
API outline:
When in forwarding mode, the stream does not return all fragments right away,
but only those belonging to the current range. Initially current range only
covers the static row. The stream can be forwarded, even before reaching end-
of-stream for current range, to a later range with fast_forward_to().
Forwarding doesn't change initial restrictions of the stream, it can only be
used to skip over data.
Monotonicity of positions is preserved by forwarding. That is fragments
emitted after forwarding will have greater positions than any fragments
emitted before forwarding.
For any range, all range tombstones relevant for that range which are present
in the original stream will be emitted. Range tombstones emitted before
forwarding which overlap with the new range are not necessarily re-emitted.
When not in forwarding mode, the stream acts as if the current range was equal
to the full range. This implies that fast_forward_to() cannot be
used.
Whether stream is in forwarding mode or not is specified when the stream
is created, typically via mutation_source interface.
What's left for later series:
Optimization by providing specialized implementations. This series implements
forwarding support in all mutation sources via generic wrapper which simply
drops fragments."
* tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev:
tests: mutation_source_tests: Verify monotonicty of positions
tests: random_mutation_generator: Spread the keys more
tests: mutation_source_test: Make blobs more easily distinguishable
tests: streamed_mutation: Test that merged stream passes mutation source tests
tests: mutation_source_test: Add tests for forwarding of streamed_mutation
tests: streamed_mutation_assertions: Add methods for navigating the stream
tests: Add range generators to random_mutation_generator
partition_slice_builder: Add with_ranges()
query: Introduce full_clustering_range
streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation()
db: Enable creating forwardable readers via mutation_source
mutation_source: Document liveness requirements
mutation_source: Cleanup
db: Replace virtual_reader_type with mutation_source_opt
partition_version: Refactor make_partition_snapshot_reader() overloads
database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
memtable: Accept all mutation_source parameters
streamed_mutation: Implement fast_forward_to() in stream merger
streamed_mutation: Add generic implementation of forwardable streamed_mutation
streamed_mutation: Add fast_forward_to() API
position_in_partition: Introduce position_range
position_in_partition: Introduce position constructor for right after the static row
streamed_mutation: Make cast to view non-explicit
streamed_mutation: Make schema() getter non-copying
So that streamed_mutation is created in only one of the overloads and
others delegate to that one. Later there will be common logic added to
the construction and doing this will help avoid a duplication.
It was using the state passed via as_mutation_source() instead. Let's
respect mutation_source contract instead, and use the state passed via
mutation_source invocation.
Technically just a cleanup. Alse prerequisite for more cleanup.
Do not ignore a future<> retuned by cycle() since it will produce a
warning in case of an error. Log it instead.
Message-Id: <20170219151811.GN11471@scylladb.com>
Set murmur3_partitioner_ignore_msb_bits to 12 (enabling the new sharding
algorithm), but do this in scylla.yaml rather than the built-in defaults.
This avoids changing the configuration for existing clusters, as their
scylla.yaml file will not be updated during the upgrade.
Message-Id: <20170214123253.3933-1-avi@scylladb.com>
Adds yet another magic function "SCYLLA_COUNTER_SHARD_LIST", indicating that
argument value, which must be a list of tuples <int, UUID, long, long>,
should be inserted as an actual counter value, not update.
This of course to allow counters to be read from sstable loader.
Note that we also need to allow timestamps for counter mutations,
as well as convince the counter code itself to treat the data as
already baked. So ugly wormhole galore.
v2:
* Changed flag names
* More explicit wormholing, bypassing normal counter path, to
avoid read-before-write etc
* throw exceptions on unhandled shard types in marshalling
v3:
* Added counter id ordering check
* Added batch statement check for mixing normal and raw counter updates
Message-Id: <1487683665-23426-2-git-send-email-calle@scylladb.com>
* seastar 5088065...4d4a58d (3):
> reactor utilization should return the utilization in 0-1 range
> collectd should ignore type label in name creation
> fix append_challenged_posix_file_impl::process_queue() to handle recursion
On database stop, we do flush memtables and clean up commit log segment usage.
However, since we never actually destroy the distributed<database>, we
don't actually free the commitlog either, and thus never clear out
the remaining (clean) segments. Thus we leave perfectly clean segments
on disk.
This just adds a "release" method to commitlog, and calls it from
database::stop, after flushing CF:s.
Message-Id: <1485784950-17387-1-git-send-email-calle@scylladb.com>
Failing to close a file properly before destroying file's object causes
crashes.
[tgrabiec: fixed typo]
Message-Id: <20170221144858.GG11471@scylladb.com>
5a0955e89d "db: add operations for
applying counter updates" merged two column_family::apply() overloads
into do_apply() in order to reduce code duplication. Unfortunately,
a call to check_valid_rp() didn't survive that change.
Message-Id: <20170221133800.30411-1-pdziepak@scylladb.com>
When loading a schema asynchronously, we're leaving a strong
reference to the loaded schema in the entry's shared future. This
patch fixed this by storing a shared_promised, which is reset when the
schema is loaded.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170220193654.17439-1-duarte@scylladb.com>
This reverts commit 1e2c01ff49.
We do not detect repeated tombstone if it follows an in-range
tombstone following a skipped clustering row, because _in_progress
will be disengaged after such tombstone is emitted.
Message-Id: <1487596080-21480-1-git-send-email-tgrabiec@scylladb.com>
Move lister class away from database.cc.
This is a preparation for moving it to the seastar library.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The current implementation of 'nodetool clearsnapshot' command only deletes the snapshots
of the keyspaces that are alive at the time the command is issued (issue #2045).
This, besides not implementing the spec, prevents users from being able to clear
the disk space occupied by snapshots of deleted keyspaces that are no longer needed
(e.g. snapshots created when KS is deleted).
This patch fixes this issue by making the database::clear_snapshot() scan the data directories
looking for the snapshots to be deleted instead of relying on in-memory data structures.
This patch makes column_family::clear_snapshot() method not needed any more.
Fixes#2045
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Filter should get all information that the caller has in hand that may be used for filtering.
directory_entry has the following information:
- Type of the entry
- Its name
For the code that used lister filters so far this would be enough, however it's not
hard to imagine a filter that may need the parent directory as well.
We will add the parent directory path in the follow up patches to make the interface
complete.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
If show_hidden parameter is set to show_hidden::yes - list hidden entries, otherwise skip them.
By default set to show_hidden::no.
This patch also completely removes default parameters in lister::scan_dir() and replaces them
with a few lister::scan_dir() overloads that ensure that lambdas are always going to
be the last parameter in the parameters list.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
There is a possibility that the type of the given entry may not be available
that would manifest in the ENOENT or ENOTDIR value set in the errno
by the fstat() call for this entry. In this case engine().file_type() will return
a not engaged optional<directory_entry_type> value.
Return the future with the std::runtime_error exception in this case.
This will prevent any further usage of the not engaged optional value by the code
in the normal flow.
The exception is going to be propagated to the caller and it's the caller's responsibility
to handle it.
Fixes#2071
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
* seastar 28a143a...5088065 (8):
> configure.py: switch cmake to build c-ares to do out-of-source-tree build
> iotune: make sure help is working
> collectd: send double correctly for gauge
> tls: make shutdown/close do "clean" handshake shutdown in background
> tls: Make sink/source (i.e. streams) first class channel owners
> native-stack: Make sink/source (i.e. streams) first class channel owners
> posix-stack: Make sink/source (i.e. streams) first class channel owners
> Merge "Detector for tasks blocked" from Glauber
Fixes#2085.
Packaging updated to require cmake, drop libtool and automake.
"This series contains some fixes and a unit test for the logic responsible
for locking counter cells."
* 'pdziepak/cell-locking-fixes/v1' of github.com:cloudius-systems/seastar-dev:
tests: add test for counter cell locker
cell_locking: fix schema upgrades
cell_locker: make locker non-movable
cell_locking: allow to be included by anyone
* cell_entry destructor may be called when the former is unlinked
* update pointer to schema in partition_entry on schema upgrade
* use correct bucket count when creating a new hash table
The test started to fail sporadically on jenkins after
7a00dd6985 due to quiesce() timing out. It's not
clear though if this is a regression because before the series such timeouts
would not cause test failure if the future resulves eventually, timeout was
only logged.
I was not able to reproduce it on my setup nor on jenkins, so let's add more
debugging output and trigger a coredump next time the test fails.
Message-Id: <1487089576-27147-1-git-send-email-tgrabiec@scylladb.com>
ninja-build-1.6.0-2.fc23.src.rpm on fedora web site deleted for some
reason, but there is ninja-build-1.7.2-2 on EPEL, so we don't need to
backport from Fedora anymore.
Fixes#2087
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487155729-13257-1-git-send-email-syuu@scylladb.com>
"Immediate reason to do this is to ensure that forwarding of streamed_mutation
will give the same mutations as slicing would, and have unit tests which
verify that those two access methods are consistent with each other.
Secondary reason is performance, to avoid processing unnecessary data.
Note that this should not cause digest mismatch of data queries during rolling
upgrade, because data queries are checksumming only tombstones affecting rows
in the results, so only relevant tombstones.
Fixes #1254."
* tag 'tgrabiec/only-relevant-range-tombstones-v2' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test that slicing returns only relevant range tombstones
tests: Pass all mutation source parameters
tests: mutation_source_tests: Ensure timestamps are strictly monotonic
tests: streamed_mutation_assertions: Add more expectation methods
tests: streamed_mutation_assertions: Make produces_end_of_stream() give better error messages
sstables: Simplify sstable_streamed_mutation::read_next()
sstables: Emit only relevant range tombstones
range_tombstone: Introduce end_position()
position_in_partition: Print position when printing fragment
position_in_partition: Make printable
position_in_partition: Add cast to view
position_in_partition: Generalize from-bound_view constructor
bound_view: Extract converters for range start and end bounds
mutation_partition: Drop unneeded range tombstones
mutation_partition: Simplify row removal
range_tombstone_list: Introduce erase()
partition_snapshot_reader: Emit only relevant tombstones
range_tombstone_stream: Add slicing apply() overload
range_tombstone_list: Introduce slice()
write_{live, counter, expiring, dead}_cell() take a const reference to
an atomic cell as argument. However, their caller (which is
write_row_cells) passes to them an atomic_cell_view.
There is an appropriate implicit constructor so instead of compiler
complaints we get atomic_cell objects being constructed from views which
involves an allocation and a copy.
Message-Id: <20170213100106.9071-1-pdziepak@scylladb.com>
* seastar 83a41c8...28a143a (5):
> prometheus: send one MetricFamily per unique metric name
> tests: Add test for circular_buffer::erase()
> circular_buffer: Introduce erase()
> protect against infinite do_until loop
> metrics: alternative metrics creation with labels
There are several problems with storage_proxy::send_to_endpoint right
now. It uses create_write_response_handler() overload that is specific
to read repair which is suboptimal and creates incorrect logs, it does
not process errors and it does not hold storage_proxy object until write
is complete. The patch fixes all of the problems.
Message-Id: <20170208101949.GA19474@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
"This series changes buffering of mutation fragments in streamed mutations
so that the size of the fragments is taken into account.
The original implementation buffered up to 16 fragments which was pretty
much meaningless since it could be far too much if the fargments were
large or not nearly enough in case they were small
Fixes #2036.."
* 'pdziepak/buffer-mfs-by-size/v1' of github.com:cloudius-systems/seastar-dev:
streamed_mutation: size-based mutation_fragment buffer limit
mutation_fragment: cache size in memory
mutation_fragment: make write access more explicit
Currently, streamed mutations buffer up to 16 mutation fragments. This
may be too much, not enough or a perfect choice depending on the
mutation fragment size.
This patch makes streamed mutation choose how much mutation fragments to
keep in the buffer depending on their size, so that we avoid using too
much memory in case of large mutation fragments and are able to buffer a
lot of fragments if they are small.
mutation_fragments are going to be caching their size in memory. In
order to be able to invalidate that correctly, they need to know when
that size may change (but avoid invalidation when it is not necessary).
"This series uses the newly added histogram and label support to add metrics to
the storage_proxy and to the column_family.
This would add latency and histogram and the missing metrics from column family."
* 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev:
database: add metrics registration for the coloumn family
storage_proxy: add read and write latency histogram
estimated_histogram: returns a metrics histogram
"This series makes sure that schemas containing both counter and non-counter
regular or static columns are not allowed."
* 'pdziepak/disallow-mixed-schemas/v1' of github.com:cloudius-systems/seastar-dev:
schema: verify that there are no both counter and non-counter columns
test/mutation_source: specify whether to generate counter mutations
tests/canonical_mutation: don't try to upgrade incompatible schemas
Tests using random mutation generator should be provided with bot
counter and non-counter mutations to ensure that both cases are
sufficiently covered. However, mixed schemas (with both counter and
non-counter columns) are not allowed so the RMG has to be explicitly
told whether to use counter or non-counter schema.
Test case test_reading_with_different_schemas uses randomly generated
pairs of mutations and tries to upgrade one to the schema of the other.
However, there are cases when one schema cannot be upgraded to another,
for example, counter and non-counter schemas.
This patch adds a metrics registration to the column_family.
Using label each column metrics is label with its keyspace and column
family name.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The metrics histogram is a struct that describe a histogram.
This patch adds a getter method that lets the estimated_histogram return
a metrics::histogram, this will allow to register it as a histogram
metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"This patchset adds bits of the MV write-path, enough to support
new entries to be added. Note that this is still limited, as only
adding new rows to a base table will work correctly."
* 'materialized-views/insert-path/v4' of https://github.com/duarten/scylla: (30 commits)
database: Apply mutation to views
column_family: Push view replica update
materialized views: partial mutate_MV
materialized views: function to send a mutation to endpoint
materialized views: add VIEW write type
database: Ensure new write_type is correctly printed
materialized views: match base and view replicas
column_family: Generate view updates
column_family: Adds affected_views() function
view: Add view_update_builder class
range_tombstone_accumulator: Expose current tombstone
range_tombstone_accumulator: apply() takes value
view_updates: Generate updates
view_updates: Adds function to replace row
view_updates: Update view entry
view_updates: Delete old view entry
mutation_partition: Introduce shadowable tombstone
view_updates: Create view entry
view_updates: Compute row marker
view: Introduce view_updates class
...
This patch changes the database apply path so that it also
generates the mutations for the column family's views and
sends them to the paired view replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This adds a function mutate_MV() which takes view mutations and sends
them to the appropriate nodes (this may be the current node, or a
remote node).
This is only a partial implementation - we still don't do the local
batch log (to survive reboots and failures) and some other stuff which
is left commented out.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Add a function for sending one mutation to one remote replica owning
this mutation. This is needed for materialized views, where each
base replica sends each view mutation to one particular view replica.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This adds to the "write_type" enum also the "VIEW" write type.
To be honest, I don't understand why the "write_type" distinction
is important.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
By removing the default case in the switch statement over a write_type
variable, we ensure the compiler warns us about lack of exhaustiveness
in case we add a value to the enum but forget to change the
corresponding operator<<().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A function to find the appropriate replica to send a view update to.
This patch creates a new source file db/view/view.cc. We should
eventually move a lot more of the materialized views code there.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds the generate_view_updates() function to the
column_family class, which will use the view_update_builder to
generate updates to the column_family's materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the view_update_builder class, which is responsible
for calculating the mutations to apply to a column family's
materialized views, given a streamed_mutation representing an update
to the base table and a streamed_mutation representing the
pre-existing rows which the update covers.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
range_tombstone_accumulator::apply() now takes a value so the caller
can decide whether to move or copy the argument.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the view_updates::generate_update() function to
generate view updates given a base row update and the corresponding,
pre-existing row. This function will decide which of the previously
introduced functions to call based on whether there is a pre-existing
row and whether there exists a regular base column that's part of the
view's PK.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a function to replace a view row given a base
table update and the pre-existing row, which simply deletes the old
view entry and adds a new one.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_updates::update_entry function,
which creates the updates to apply to the existing view entry given
the base table row before and after the update.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_updates::delete_old_entry function,
which creates a view row mutation to delete an entry given an updated
base table row.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces shadowable row tombstones. A shadowable row
tombstone is valid only if the row has no live marker. In other words,
the row tombstone is only valid as long as no newer insert is done
(thus setting a live row marker; note that if the row timestamp set
is lower than the tombstone's, then the tombstone remains in effect
as usual).
If a row has a shadowable tombstone with timestamp Ti and that row
is updated with a timestamp Tj, such that Tj > Ti (and that update
sets the row marker), then the shadowable tombstone is shadowed by
that update. A concrete consequence is that if the update has cells
with timestamp lower than Ti, then those cells are preserved (since
the deletion is removed), and this is contrary to a regular,
non-shadowable row tombstone where the tombstone is preserved and
such cells are removed.
Currently, only Materialized Views require shadowable row tombstones,
which solve a problem with view row deletions. Consider a base row with
columns p, v1, v2, PRIMARY KEY (p) denormalized into a view row consisting
of columns p, v1, v2 PRIMARY KEY (p, v1), and the following operations:
1) INSERT INTO base (p, v1, v2) VALUES (0, 0, 1) USING TIMESTAMP 0;
2) UPDATE base SET v1 = 1 USING TIMESTAMP 1 WHERE p = 0;
3) UPDATE base SET v1 = 0 USING TIMESTAMP 2 WHERE p = 0;
Without shadowable tombstones, the view contains:
At 1), pk = (0, 0), row_marker@T0, v2=1@T0
At 2), pk = (0, 0), row_marker@T0, row_tombstone@T1, v2=1@T0
pk = (0, 1), row_marker@T1, v2=1@T0
At 3), pk = (0, 0), row_marker@T2, row_tombstone@T1, v2=1@T0
pk = (0, 1), row_marker@T1, row_tombstone@T2, v2=1@T0
Notice how, if we read row (0, 0), the value of v2 will be shadowed by
the row tombstone we previously inserted.
With a view's row tombstone becoming shadowable, at 3) the row (0, 0)
will look like pk = (0, 0), row_marker@T2, shadowable_tombstone@T1, v2=1@T0,
which is equivalent to pk = (0, 0), row_marker@T2, v2=1@T0.
Since the shadowable tombstone is shadowed by the new row marker (T0 <
T2), now v2 would be taken into account.
Finally, note that this patch doesn't generalize the idea of
shadowable tombstone, instead taking advantage of the fact that they
are only needed by Materialized Views. This saves changing the
tombstone representation to account for an extra flag, the bits such
representation would require, and also avoids changes to the storage
format.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_updates::create_entry function, which
creates a view row mutation given a new base table row.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a function to compute the row marker of a view row
given the base row. There are two cases to consider when building the
row marker: 1) there is a column C that is a regular base column but
is in the view PK; and 2) the columns for the base and the view PKs
are the same.
For 1), the view row marker timestamp will be the biggest between the
base's row marker and C. The TTL will be that of C. This means that if
C expires, the view row maker will expire as well (and the row, if no
other column is keeping it alive). Note that if the base row marker
expires but not C, then the base row will still be live due to C and
we shouldn't expire the view row.
For 2), the view row timestamp will be the same as the base row
timestamp. The TTL should be set in such a way that both base and view
rows live for the same time. We thus set the view row TTL to be the
max of any other TTL in the base row. This is particularly important
in the case where the base row marker has a TTL, but a column *absent*
from the view holds a greater one.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_updates class, which is responsible
for generating and storing updates to a particular materialized view.
The updates will be generated from an updated base row and the
pre-existing one (if any), in later patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the collection_type_impl::for_each_cell()
function, which allows the caller to iterate over the cells of a
particular collection_mutation_view.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
To help calculate the view mutations from a base update, we store in
the view class the column that's part of the view's primary key but
not part of the base's, if such column exists.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the matches_view_filter() function which specifies
whether a given base row matches the view filter. Unlike
may_be_affected_by(), this function has no false positives.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the is_satisfied_by() function to
single_column_restriction, which given a clustering row returns
whether the restrictions applies or not.
This is useful for secondary indexing such as materialized views,
where filters on regular columns precisely select which base table
rows to denormalize.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch exposes the non-primary key column restrictions in a given
select statement, exposing them as single_column_restrictions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the is_set() and is_list() functions to
collection_type_impl, which identify the concrete collection
type.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch renames abstract_restriction::uses_function() to
term_uses_function(), as it was previously hiding a function with the
same name in the restriction base class.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the may_be_affected_by() function to the view class,
which is responsible to determine whether an update to a base class
affects one of its views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of storing the view in the column_family's map of materialized
views, store a lw_shared_ptr so that the view can be removed while it
is being updated.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch fixes a use-after-free error in
rename_column_in_where_clause(), where we were creating a boost
adaptor on an rvalue.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Fixes #1531
Adds lookup to gms::inet_address and uses it in (hopefully all) the
salient places where configured symbolic names are interpreted.
Removes the dummy dns modula in scylla in favour of the seastar one."
* 'calle/use-dns' of github.com:cloudius-systems/seastar-dev:
remove scylla dns code
service::storage_service: Remove depedency on scylla dns
main.cc: remove scylla dns dependency
main/init: Lookup inet addresses from config by dns lookup
db::system_keyspace: Find rpc_address by lookup
gms::inet_address: Add lookup functionality.
scylla tls: Add option support for client auth and tls opts
This helps achieve more repeatable runs that can then be compared via the
Linux perf tool. The option overrides duration-based testing and runs the
test for a specific number of iterations.
Message-Id: <20170204172937.8462-1-avi@scylladb.com>
Merge commit 45b6070832 used butchered version of storage_proxy
patch to adjust to rpc timer change instead the one I've sent. This
patch fixes the differences.
Message-Id: <20170206095237.GA7691@scylladb.com>
Refs #1813 (fixes scylla part)
Added require_client_auth and priority_string options to
server_encryption_options/client_encryption_options an process them.
Allows TLS method/algo specification. Also enabled enforcing known cert
authentication for both node-to-node and client communication.
scylla-housekeeping requires to run 'restart mode' for check the version during
scylla-server restart, which wasn't called on systemd timer so added it.
Existing scylla-housekeeping.timer renamed to scylla-housekeeping-daily.timer,
since it is running 'daily mode'.
Fixes#1953
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1486180031-18093-1-git-send-email-syuu@scylladb.com>
The comparator constructor took schema by value instead of const l-ref
and, consequently, later tried to access object that has been destroyed
long time ago.
Message-Id: <20170202135853.8190-1-pdziepak@scylladb.com>
"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
Before, the logic for releasing writes blocked on dirty worked like
this:
1) When region group size changes and it is not under pressure and
there are some requests blocked, then schedule request releasing
task
2) request releasing task, if no pressure, runs one request and if
there are still blocked requests, schedules next request
releasing task
If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.
However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.
The group size can also change when memory is reclaimed from the
groups (e.g. when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.
Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.
Refs #2021.
Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.
The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
The hard pressure was only signalled on region group when
run_when_memory_available() was called after the pressure condition
was met.
So the following loop is always an infinite loop rather than stopping
when engouh is allocated to cause pressure:
while (!gr.under_pressure()) {
region.allocate(...);
}
It's cleaner if pressure notification works not only if
run_when_memory_available() is used but whenever conditino changes,
like we do for the soft pressure.
There is comment in run_when_memory_available() which gives reasons
why notifications are called from there, but I think those reasons no
longer hold:
- we already notify on soft pressure conditions from update(), and if
that is safe, notifying about hard pressure should also be safe. I
checked and it looks safe to me.
- avoiding notification in the rare case when we stopped writing
right after crossing the threshold doesn't seem benefitial. It's
unlikely in the first place, and one could argue it's better to
actually flush now so that when writes resume they will not block.
We already call these when crossing the soft threshold. We shouldn't
stop reclaiming when hard pressure is gone because soft pressure may
still be present. Calling start_reclaiming() on hard pressure is
unnecessary because soft pressure also starts it, and when there is
hard pressure there is also soft pressure.
This document is intended to help developers and contributors to Scylla get started. The first part consists of general guidelines that make no assumptions about a development environment or tooling. The second part describes a particular environment and work-flow for exemplary purposes.
## Overview
This section covers some high-level information about the Scylla source code and work-flow.
### Getting the source code
Scylla uses [Git submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules) to manage its dependency on Seastar and other tools. Be sure that all submodules are correctly initialized when cloning the project:
```bash
$ git clone https://github.com/scylladb/scylla
$ cd scylla
$ git submodule update --init --recursive
```
### Dependencies
Scylla depends on the system package manager for its development dependencies.
Running `./install_dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
### Build system
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native thread, and up to 3 GB per native thread while linking.
Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.
To build for the first time:
```bash
$ ./configure.py
$ ninja-build
```
Afterwards, it is sufficient to just execute Ninja.
The full suite of options for project configuration is available via
```bash
$ ./configure.py --help
```
The most important options are:
-`--mode={release,debug,all}`: Debug mode enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer) and allows for debugging with tools like GDB. Debugging builds are generally slower and generate much larger object files than release builds.
-`--{enable,disable}-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
Source files and build targets are tracked manually in `configure.py`, so the script needs to be updated when new files or targets are added or removed.
To save time -- for instance, to avoid compiling all unit tests -- you can also specify specific targets to Ninja. For example,
Unit tests live in the `/tests` directory. Like with application source files, test sources and executables are specified manually in `configure.py` and need to be updated when changes are made.
A test target can be any executable. A non-zero return code indicates test failure.
Most tests in the Scylla repository are built using the [Boost.Test](http://www.boost.org/doc/libs/1_64_0/libs/test/doc/html/index.html) library. Utilities for writing tests with Seastar futures are also included.
Run all tests through the test execution wrapper with
```bash
$ ./test.py --mode={debug,release}
```
The `--name` argument can be specified to run a particular test.
Alternatively, you can execute the test executable directly. For example,
```bash
$ build/release/tests/row_cache_test -- -c1 -m1G
```
The `-c1 -m1G` arguments limit this Seastar-based test to a single system thread and 1 GB of memory.
### Preparing patches
All changes to Scylla are submitted as patches to the public mailing list. Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/). There are also some guidelines that can help you make the patch review process smoother:
1. Before generating patches, make sure your Git configuration points to `.gitorderfile`. You can do it by running
```bash
$ git config diff.orderfile .gitorderfile
```
2. If you are sending more than a single patch, push your changes into a new branch of your fork of Scylla on GitHub and add a URL pointing to this branch to your cover letter.
3. If you are sending a new revision of an earlier patchset, add a brief summary of changes in this version, for example:
```
In v3:
- declared move constructor and move assignment operator as noexcept
- used std::variant instead of a union
...
```
4. Add information about the tests run with this fix. It can look like
```
"Tests: unit ({mode}), dtest ({smp})"
```
The usual is "Tests: unit (release)", although running debug tests is encouraged.
5. When answering review comments, prefer inline quotes as they make it easier to track the conversation across multiple e-mails.
### Finding a person to review and merge your patches
You can use the `scripts/find-maintainer` script to find a subsystem maintainer and/or reviewer for your patches. The script accepts a filename in the git source tree as an argument and outputs a list of subsystems the file belongs to and their respective maintainers and reviewers. For example, if you changed the `cql3/statements/create_view_statement.hh` file, run the script as follows:
Tomasz Grabiec <tgrabiec@scylladb.com> [maintainer]
Pekka Enberg <penberg@scylladb.com> [maintainer]
MATERIALIZED VIEWS
Pekka Enberg <penberg@scylladb.com> [maintainer]
Duarte Nunes <duarte@scylladb.com> [maintainer]
Nadav Har'El <nyh@scylladb.com> [reviewer]
Duarte Nunes <duarte@scylladb.com> [reviewer]
```
### Running Scylla
Once Scylla has been compiled, executing the (`debug` or `release`) target will start a running instance in the foreground:
```bash
$ build/release/scylla
```
The `scylla` executable requires a configuration file, `scylla.yaml`. By default, this is read from `$SCYLLA_HOME/conf/scylla.yaml`. A good starting point for development is located in the repository at `/conf/scylla.yaml`.
For development, a directory at `$HOME/scylla` can be used for all Scylla-related files:
The `scylla.yaml` file in the repository by default writes all database data to `/var/lib/scylla`, which likely requires root access. Change the `data_file_directories` and `commitlog_directory` fields as appropriate.
Scylla has a number of requirements for the file-system and operating system to operate ideally and at peak performance. However, during development, these requirements can be relaxed with the `--developer-mode` flag.
Additionally, when running on under-powered platforms like portable laptops, the `--overprovisined` flag is useful.
Multiple release branches are maintained on the Git repository at https://github.com/scylladb/scylla. Release 1.5, for instance, is tracked on the `branch-1.5` branch.
Similarly, tags are used to pin-point precise release versions, including hot-fix versions like 1.5.4. These are named `scylla-1.5.4`, for example.
Most development happens on the `master` branch. Release branches are cut from `master` based on time and/or features. When a patch against `master` fixes a serious issue like a node crash or data loss, it is backported to a particular release branch with `git cherry-pick` by the project maintainers.
## Example: development on Fedora 25
This section describes one possible work-flow for developing Scylla on a Fedora 25 system. It is presented as an example to help you to develop a work-flow and tools that you are comfortable with.
### Preface
This guide will be written from the perspective of a fictitious developer, Taylor Smith.
### Git work-flow
Having two Git remotes is useful:
- A public clone of Seastar (`"public"`)
- A private clone of Seastar (`"private"`) for in-progress work or work that is not yet ready to share
The first step to contributing a change to Scylla is to create a local branch dedicated to it. For example, a feature that fixes a bug in the CQL statement for creating tables could be called `ts/cql_create_table_error/v1`. The branch name is prefaced by the developer's initials and has a suffix indicating that this is the first version. The version suffix is useful when branches are shared publicly and changes are requested on the mailing list. Having a branch for each version of the patch (or patch set) shared publicly makes it easier to reference and compare the history of a change.
Setting the upstream branch of your development branch to `master` is a useful way to track your changes. You can do this with
As a patch set is developed, you can periodically push the branch to the private remote to back-up work.
Once the patch set is ready to be reviewed, push the branch to the public remote and prepare an email to the `scylladb-dev` mailing list. Including a link to the branch on your public remote allows for reviewers to quickly test and explore your changes.
### Development environment and source code navigation
Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt`, for use only with development environments (not for building) so that they can properly analyze the source code.
[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice for code hygiene, though its C++ parser sometimes makes errors and flags false issues.
Other good options that directly parse CMake files are [KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).
To use the `CMakeLists.txt` file with these programs, define the `FOR_IDE` CMake variable or shell environmental variable.
[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects, and its C++ parser has many similar issues as CLion.
### Distributed compilation: `distcc` and `ccache`
Scylla's compilations times can be long. Two tools help somewhat:
- [ccache](https://ccache.samba.org/) caches compiled object files on disk and re-uses them when possible
- [distcc](https://github.com/distcc/distcc) distributes compilation jobs to remote machines
A reasonably-powered laptop acts as the coordinator for compilation. A second, more powerful, machine acts as a passive compilation server.
Having a direct wired connection between the machines ensures that object files can be transmitted quickly and limits the overhead of remote compilation.
The coordinator has been assigned the static IP address `10.0.0.1` and the passive compilation machine has been assigned `10.0.0.2`.
On Fedora, installing the `ccache` package places symbolic links for `gcc` and `g++` in the `PATH`. This allows normal compilation to transparently invoke `ccache` for compilation and cache object files on the local file-system.
Next, set `CCACHE_PREFIX` so that `ccache` is responsible for invoking `distcc` as necessary:
```bash
exportCCACHE_PREFIX="distcc"
```
On each host, edit `/etc/sysconfig/distccd` to include the allowed coordinators and the total number of jobs that the machine should accept.
This example is for the laptop, which has 2 physical cores (4 logical cores with hyper-threading):
`10.0.0.2` has 8 physical cores (16 logical cores) and 64 GB of memory.
As a rule-of-thumb, the number of jobs that a machine should be specified to support should be equal to the number of its native threads.
Restart the `distccd` service on all machines.
On the coordinator machine, edit `$HOME/.distcc/hosts` with the available hosts for compilation. Order of the hosts indicates preference.
```
10.0.0.2/16 localhost/2
```
In this example, `10.0.0.2` will be sent up to 16 jobs and the local machine will be sent up to 2. Allowing for two extra threads on the host machine for coordination, we run compilation with `16 + 2 + 2 = 20` jobs in total: `ninja-build -j20`.
When a compilation is in progress, the status of jobs on all remote machines can be visualized in the terminal with `distccmon-text` or graphically as a GTK application with `distccmon-gnome`.
One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next section speeding up this process.
### Using the `gold` linker
Linking Scylla can be slow. The gold linker can replace GNU ld and often speeds the linking process. On Fedora, you can switch the system linker using
```bash
$ sudo alternatives --config ld
```
### Testing changes in Seastar with Scylla
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.