leveled strategy uses heavily first and last decorated keys of a
sstable to get overlapping sstables in a given level. By storing
first and last decorated keys in sstable object, it's expected
that performance of leveled strategy (not compaction) will be
improved.
We will set first and last keys in sstable when either loading
or sealing it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0abca819454ab4c088541bb49714f1f6a7dc4f42.1473959677.git.raphaelsc@scylladb.com>
* seastar 0303e0c...e534401 (6):
> Merge "enable rpc to work on non contiguous memory for receive" from Gleb
> install-dependencies.sh: install python3 for Ubuntu/Debian, which requires for configure.py
> fix tcp stuck when output_stream write more than 212992 bytes once.
> scripts/posix_net_conf.sh: supress 'ls: cannot access /sys/class/net/<NIC>/device/msi_irqs/' error message
> scripts/posix_net_conf.sh: fix 'command not found' error when specifies --cpu-mask
> native_network_stack: Fix use after free/missing wait in dhcp
Includes: "Remove utils::fragmented_input_stream and utils::input_stream in favor of seastar version" from Gleb.
That will be needed for optimization that will store decorated keys
in the sstable object, and also for a subsequent work that will
detect wrong metadata (min/max column names) by looking at columns
in the schema. As schema is stored in sstable, there's no longer
a need to store ks and cf names in it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This reverts commit 1726b1d0cc.
Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.
Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.
The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.
Conflicts:
row_cache.hh
Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch makes the optional trace_state_ptr arguments introduced in
previous patches mandatory where possible. Functions which are called
internally don't have a trace context, so for those we keep the
argument's default value for convenience.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"clustering_key_filtering_context is no longer needed.
partition_slice can be used instead so this series removes
clustering_key_filtering_context and passes partition_slice down where
it's needed. Then a static get_ranges method is used to obtain
clustering key ranges for a given partition.
Fixes #1614."
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Reversed iterators are adaptors for 'normal' iterators. These underlying
iterators point to different objects that the reversed iterators
themselves.
The consequence of this is that removing an element pointed to by a
reversed iterator may invalidate reversed iterator which point to a
completely different object.
This is what happens in trim_rows for reversed queries. Erasing a row
can invalidate end iterator and the loop would fail to stop.
The solution is to introduce
reversal_traits::erase_dispose_and_update_end() funcion which erases and
disposes object pointed to by a given iterator but takes also a
reference to and end iterator and updates it if necessary to make sure
that it stays valid.
Fixes#1609.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1472080609-11642-1-git-send-email-pdziepak@scylladb.com>
This patch makes append() and write() limit the maximum size of a single
allocation to bytes_ostream::max_chunk_size.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Once unlink_leftmost_without_rebalance() has been called on a bi::set no
other method can be used. This includes clear_and_disposed() used by the
mutation_partition destructor.
We like unlink_leftmost_without_rebalance() because it is efficient, so
the solution is to manually finish destroying clustering row and range
tombstone sets in the reader destructor using that function.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch introduces the nonwrapping_range class. This class is
intended to be used by code that requires non wrapping ranges.
Internally, it uses a wrapping_range. Users are responsible for
ensuring the bounds are correct when creating a nonwrapping_range.
The path proposed here is to incrementally replace usages of
wrapping_range/range by nonwrapping_range, pushing usages of wrapping
ranges as further to the edges as possible.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.
Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.
Flush queue can now (optionally) propagate exceptions to all clients, and
commit log uses this to ensure that commit log writers in batch mode
all generate exceptions on disk errors.
Also includes some rudimentary tests for flush queue mechanisms.
Note: other main user, sstable flushing, is not affected, as default
mode is still to keep exceptions to individual worker continuations,
not waiters."
In this unit test, we create using Scylla C++ code, the same large
partition with 13520 CQL rows as we previously imported from Cassandra
for the large partition test. We then verify that the sstable index file
we just wrote is byte-for-byte identical to the one previously created by
Cassandra. They should indeed be identical, because the data file has the
same layout (even if timestamps are different) and our default promoted-
index block size is the same (64K) so the sample of columns should be
identical.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, the main sstable data parsing entry point data_consume_rows()
takes a contiguous range of bytes to read from disk and parse. This range
is supposed to be an entire partition or contiguous group of partitions.
and is self contained (can be parsed without extra information about the
identity of these partitions).
For the promoted index feature (which we will add in a following patch)
we will want the range to span only a part of a partition, and will need
the caller to provide some information not available to the parser (such
as the partition's key). In the future, we will also want to support a
vector of byte ranges, instead of just one.
So in preparation for this, this patch simply replaces the start/end pair
by a new class disk_read_range, which can be easily extended in later
patches. No new functionality is introduced in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test that takes an sstable with one partition of 13,520
clustering rows (spanning 700 KB in the data file), and attempts to read
various slices CQL rows, counting that we got back the expected number
of rows.
The sstable included here was generated by Cassandra, and includes a
promoted index. Promoted index reading is not supported yet (we will
add it in the next patch), so for now the code will always read the
entire partition from disk; But still the clustering-key filtering is
already functional, and will drop some of the rows as requested,
so this test will pass.
Later, when we add promoted index support, we should check that this test
still passes - promoted index will make the reads in this test more
efficient (which the test cannot verify), but the important thing to check
is that it doesn't break any of these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Once we encounter a wide partition store information
about this in cache entry and don't try to read it all
and cache next time it's requested.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[Paweł: rebased, moved large partition reading logic to
cache_entry::read_wide()]
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In case of failure BOOST_REQUIRE_EQUAL() is nicer and prints the actual
values that were supposed to be equal, but aren't.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
sstable has member functions that create objects which need to extend
lifetime of the sstable (for example mutation_readers), the easiest way
to achieve that is to enable_lw_shared_from_this for sstable.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The test bit rot. One thing is that cache is no longer empty right
after creation, since we added ghost entries to it. Second problem is
that mutation serializer needs storage service to be initialized, so
we need to setup a full cql test env.
Message-Id: <1469626546-4279-1-git-send-email-tgrabiec@scylladb.com>
Cassandra 1.x clusters often use RandomPartitioner. Supporting
RandomPartitioner will allow easier migration to Scylla
Tests are added to make sure scylla generates the same token as
Cassandra does for the same partition key.
Fixes#1438
Message-Id: <3bc8b7f06fad16d59aaaa96e2827198ce74214c6.1469166766.git.asias@scylladb.com>
"When reading a partition try to read it all
but once more bytes are read than a given limit
we decide that partition is wide and we don't cache it.
Instead we retry the read with clustering key filtering
applied."
That was a bug in the test itself. It could happen that a sstable would
incorrectly belong to the next time window if the current minute is
approaching its end. Fix is about having all sstables that we want in
the same time window with the same min/max timestamp.
Fixes#1448.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ee25d49e7ed12b4cf7d018a08163404c3d122e56.1468782787.git.raphaelsc@scylladb.com>
This patch adds the deoverlap function to range.hh, which takes in a
vector of possibly overlapping ranges and returns a vector of
non-overlapping ranges covering the same values.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
From Paweł:
This is another episode in the "convert X to streamed mutations" series.
Hashing mutations (mainly for repair) is converted so that it doesn't
need to rebuild whole mutation.
The first part of the series changes the way streamed mutations deal
with range tombstones. Since it is not necessary to make sure we write
disjoint tombstones to sstables there is no need anymore for streamed
mutations to produce disjoint tombstones and, consequently, no need for
range tombstones to be split into range_tombstone_begin and
range_tombstone_end.
The second part is the actual hashing implementation. However, to ensure
that the hash depends only on the contents of the mutation and no the
way it is stored in different data sources range tombstones have to be
made disjoint before they are hashed.
This series also ensures that any changes caused by streamed mutations
to hashing and streaming do not break repair during upgrade.