"This series cleans the streaming_histogram and the estimated histogram that
were importad from origin, it then uses it to get the estimated min and max row
estimation in the API."
mutation_result_merger can outlive query::read_command, so it have to
hold shared pointer to it instead of reference. The bug was introduced by
89e36541c3
Currently limit is enforced only on partition boundary, so real result
can contain 2*row_limit - 1 rows in the worst case. Fix it by trimming
rows from a mutation if only part of its rows fit the requested limit.
This adds the latency histogram support to the storage_proxy.
It uses a the latency object to mark the opetation latency, if there
will be an impact on performance, it can be changed from all operations
to sample of the operation.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
If the requested range wraps around the end of the ring, make_local_reader()
ends up trying to create an sstable reader with that wrapping range, which
is not supported. Also our shard-looping code is wrong in a wrapping range.
The solution is simple: split the wrap-around range into two (from the start
of the range until the end of the ring, and then from the beginning of the
ring until the end of the range), and read each of these subranges normally.
This feature is needed to allow streaming a range that wraps around, because
streaming currently uses make_local_reader().
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
make_local_reader takes a partition-range parameter, passed by *reference*.
The meaning of this passing-by-reference is not documented: when
make_local_reader returns, which of the following two statements holds?
1. make_local_reader still holds this reference, so the caller is
forced to keep the range object alive. For how long?
or
2. make_local_reader reads the range before returning, and when it
returns, the caller continues to "own" the range object, and is
allowed to do anything, including delete it.
The principle of least surprise suggests that #2 is better, unless
there is a very good reason to choose #1, and unless this requirement to
keep the argument alive (and for how long) was clearly documented.
But neither is the case here - the overhead of copying the argument is
negligable compared to the other overheads of make_local_reader, and
nothing was documented.
But it turns out the current code did #1 - the address of the range was
passed around *after* make_local_reader returns - the different readers
on different cpus continue to use it to eventually find the right sstable
byte range to iterate over. It very easy to forget this and call
make_local_reader on a on-stack range variable, and the result is a
hard-to-debug use-after-free mess.
This patch switches us to situation #2: Before make_local_reader returns,
the range is copied to whoever needs to hold it after the return, namely
each individual shard_reader. The patch to do this is trivial (one
character removed).
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
There's nothing legacy about it so rename legacy_schema_tables to
schema_tables. The naming comes from a Cassandra 3.x development branch
which is not relevant for us in the near future.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Instead of merging shard data using make_combined_reader(), take advantage
of the fact that shard data is disjoint, and use make_joining_reader().
This removes the need to sort the partitions as they are being read.
Current code passes the storage_proxy instance of the shard on which the
message was received instead of using the storage_proxy instance of the
shard that sent the request.
Signed-off-by: Shlomi Livne <shlomi@cloudius-systems.com>
It will be stored in the schema, so move there, where it belongs. We'll need
to do more than just store a type, so provide a class that encapsulates it.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Now that consistency level dependency issues are sorted out, move
request_timeout_exception to exceptions.hh.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Move unavailable_exception to exceptions.hh where other CQL transport
level exceptions are defined in.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Use the shiny new read_timeout_exception that handles CQL transport
protocol encoding properly.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
This patch set the write timmers: histogram, timeout and unavailable.
For the histogram a latency is needed. For that the latency object is
used.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
range::is_wrap_around() and range::contains() rely on total ordering
on values to work properly. Current ring_position_comparator was only
imposing a weak ordering (token positions equal to all key positions
with that token).
range::before() and range::after() can't work for weak ordering. If
the bound is exclusive, we don't know if user-provided token position
is inside or outside.
Also, is_wrap_around() can't properly detect wrap around in all
cases. Consider this case:
(1) ]A; B]
(2) [A; B]
For A = (tok1) and B = (tok1, key1), (1) is a wrap around and (2) is
not. Without total ordering between A and B, range::is_wrap_around() can't
tell that.
I think the simplest soution is to define a total ordering on
ring_position by making token positions positioned either before or
after all keys with that token.
Transport layer expects to get error code in an exception of type
exceptions::cassandra_exception. Fix code to use it as a base for
all user visible exceptions and put correct error code there.
Make storage proxy a singleton and add helpers to look up a reference.
This makes it easier to convert code from Origin in areas such as
storage service.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
If we had a range (x; ...] then x is excluded, but token iterator was
initialized with x. The splitting loop would exit prematurely because
it would detect that the token is outside the range.
The fix is to teach ring_range() to recognize this and always give
tokens which are not smaller than the range's lower bound.
Origin has no notion of a maximum token so a range without upper bound
is represented as (x; min]. The splitting code is supposed to produce
only non-wrapping ranges, but (x; min] looks like a wrapping range, so
database code which consumes it would have to special-case for it. A
simpler solution is to change the splitting code to never produce a
wrapping range.
The wrappers also take care of cases when each bound is undefined, and
return mimimum or maximum token respectively, which fixes undefined
behavior in get_restricted_ranges().
Alternative solution would be to make partition_range use a different
type of range, the one where bounds are always specified. However it's
not worth introducing a new range type just for those few users.
The patch introduces reconciliation code. The same code suppose to be
working for both range and single key queries. Handling of raw_limit,
short reads and read repairs is still very much missing.
--
v1->v2:
- call live_row_count() only once.