Current speculation target selection logic has several bugs in multi-dc
setup. It may select a non local target for CL=LOCAL and it may select
more than one target to speculate, one of which is non local.
Examples:
1. Two dataceneters: DC1 RF 2, DC2 RF 2 and read with LOCAL_QUORUM.
In this scenario db::filter_for_query() will return both replicas from
local DC and speculation target selection logic will peek one one which
will be in different DC.
2. Two dataceneters: DC1 RF 2, DC2 RF 2 and read with LOCAL_ONE + RRD.DC_LOCAL
In this scenario db::filter_for_query() will return all nodes in local DC and
there already be enough nodes to speculate, but current logic will add
one node from non local dc as a speculation target.
The patch below fixed both of those scenarios.
Message-Id: <20161103154637.GS7766@scylladb.com>
Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>
partition_range passed to row_cache::make_reader
has to be kept alive as long as the resulting reader
is used.
Otherwise weird things start to happen.
This used to work just because of a pure luck.
When I started changing the row_cache implementation
I run into very weird behaviors for this tests.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>
Commit 8fca1887c2 ("storage_service: fix range wrapping in
describe_ring") fixed incorrect range wrapping code for describe_ring,
but fails when the number of endpoints for a token is greater than one,
because the endpoints are stored in an unordered vector.
Fix by comparing the endpoints in a way that ignores their order.
Message-Id: <1478460826-15923-1-git-send-email-avi@scylladb.com>
Fixes#1775
stream lacks a check "is_open", which is a bummer. We have to both
prevent exception propagation and add a flag of our own to make sure
exceptions in producer code reaches consumer, and does not simply
get lost in the reactor.
Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>
This patchset adds missing properties to the create_view_statement,
such as whether the view is compact or the order of its clustering
columns.
Fixes#1766
"The atomic sstable deletion provides exception safety at the cost of
quadratic behavior in the number of sstables awaiting deletion. This
causes high cpu utilization during startup.
Change the code to avoid quadratic complexity, and add some unit tests.
See #1812."
In order to ensure exception safety, the atomic sstable deletion code
creates a copy of the list of sstables pending deletion, modifies that
copy, and then replaces the original data with the copy. This guarantees
that any exception does not change the data, since the assignment does
not require allocation.
However, it does result in quadratic behavior. During startup, all
sstables are loaded on each shard, and each shard deletes sstables that
are do not have any partitions served by that shard; this results in
almost all sstables being deleted from all shards, with all that work
going to shard 0; the list grows to O(nr sstables), and there are
O((nr sstables) * (nr shards)) operations to perform.
Fix by replacing the copy-modify-assign method with an in-place update,
but one that is designed to only commit changes after all allocations
have been made; in addition, instead of using a list, use a hash table,
removing another source of quadratic behavior.
Fixes#1812 (the quadratic beahvior part).
"Currently, partition range queries are processed in parallel on all
shards. This is inefficient because we are likely to drop the results
from all but one shard, assuming a well-populated column family. We
are multiplying our work by a factor of smp::count.
While this is worthwhile in its own right, it is really an excuse to
sneak in the range/shard generator (patch 5), which is preliminary for
a new sharding algorithm, dividing tokens among shards based on the
middle-significant bits rather than the most-siginificant bits (which
alias with vnodes)
Fixes #1573."
Instead of asking a shard for cmd->partition_limit and cmd->row_limit,
just ask it for the number of partitions and rows still needed to
satisfy the query. This removes the need to trim the shard's result.
Since every shard might cause the row_limit quota to be satisfied, every
shard might be the last one we need. Hence it is better to process shards
sequentially, stopping if the quota is reached or the range is exhausted.
The original code tried to yield to reduce latency, but this is now
unnecessary, as we're doing a lot less work per iteration (if it becomes
necessary, we should do it on the replica shard, not the coordinating shard).
Building on the single-range sharder, add a sharder for vectors of
partition ranges. This helps with wrapped ranges, which are translated
into a vector containing two shards.
Divides a ring_position range into a sequence of shard/range pairs. This
allows sequential iteration over shards in ring order.
The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page. With the generator, we can switch to sequential
execution.
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.
To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.
It is a sort-of inverse to shard_of(), in that
shard_of(token_for_next_range(t)) == shard_of(t) + 1
- Add a inserts, updates, deletes members to cql_stats.
- Store cql_stats& in a modification_statement and increment the corresponding counter according to the value of a "type" field.
- Store cql_stats& in a batch_statement and increment the statistics for each BATCH member.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Add a "reads" counter to a cql3::cql_stats struct.
- Store a reference for a query_processor::_cql_stats in the select_statement object.
- Increment a "reads" counter where needed.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Right now we are calculating latencies only when we are about to add an
item to the memtable.
That's incorrect and misleading, for two reasons. First, it leaves the
commitlog latencies out. But second, it is done after the memtable wall
effect is applied, which means we are not counting throttle time neither
in the memtables or in the commitlog.
To do that, we'll start the latency_counter object as soon as possible
and move it all the way to apply_in_memory(). That should span the
entire write operation.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>
There are two variants of apply_in_memory() being called in do_apply():
with and without the commitlog. The main differences are that when the
commitlog is involved, we need to wait for its future to complete before
moving to apply_in_memory. That can easily be factored out by providing
an always-ready future if we don't have the commitlog enabled, and
waiting on that.
The second, is that the commitlog version can cause apply_in_memory to
generate an exception if there is replay position reordering. However,
there is no harm in appending the exception handler to both versions. In
one of them it's an impossible exception, but that's fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8cee0cad9b1930a057a24e095f0a655069ae8be2.1478111467.git.glauber@scylladb.com>
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.
If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
Since size estimates are stored as wrapped ranges, we call compat::wrap()
to convert from the now-standard unwrapped ranges back to wrapped ranges.
However, compat::wrap() relies on the ranges being in sorted order,
but our input is not. This leads to a crash as we find an unexpected
empty token in the middle of the vector.
Sort it so compat::wrap() works as expected.
Fixes#1804.
Message-Id: <1478161908-25051-1-git-send-email-avi@scylladb.com>
Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>