"Added support for min/max functions over date/timestamp/timeuuid.
There was one issue with Scylla's type system internals: no C++ type
was mapped to these types. So special "native_types" were added for them.
It required some changes to native functions because these types don't support
the same operations as their real native counterparts.
Fixes #3104."
* 'danfiala/3104-v1' of https://github.com/hagrid-the-developer/scylla:
tests: Tests for min/max aggregate functions over date/timestamp and timeuuid.
functions: Added min/max functions for date/timestamp/timeuuid.
types: Added native types for timestamp and timeuuid.
Advertise compatibility with CQL Version 3.3.2, since CAST functions are supported.
Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands
and makes the --jobs argument work as supposed.
Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212904.85907-1-mel@aiven.io>
Test fails after fa5a26f12d because generated sstable doesn't contain data for the
shard it was created at, so sharding metadata is empty, resulting in exception
added in the aforementioned commit. That's fixed by using the new make_local_key()
to generate data that belongs to current shard.
make_local_keys(), from which make_local_key() is built on top of, will be useful
to make sstable test work again with any smp count.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180114032025.26739-1-raphaelsc@scylladb.com>
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.
We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.
This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.
In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.
We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.
After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.
Fixes#2462
* git@github.com:glommer/scylla.git timeouts-v8.1:
database: delete unused function
consolidate timeout_clock
mutation_query: add a timeout to the mutation query path
flat_mutation_reader: pass timeout down to consume()
add a timeout to fill_buffer
add a timeout to fast forward to
restricted_mutation_reader: don't pass timeouts through the config
structure
allow request-specific read timeouts in storage proxy reads
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.
We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.
This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.
In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.
We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.
After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.
Fixes#2462
Implementation notes, and general comments about open discussion in 2462:
* Because we are not bypassing the timeout, just setting it high enough,
I consider the concerns about the batchlog moot: if we fail for any
other reason that will be propagated. Last case, because the timeout
is per-CF, we could do what we do for the dirty memory manager and
move the batchlog alone to use a different timeout setting.
* Storage proxy likes specifying its timeouts as a time_point, whereas
when we get low enough as to deal with the read_concurrency_config,
we are talking about deltas. So at some point we need to convert time_points
to durations. We do that in the database query functions.
v2:
- use per-request instead of per-table timeouts.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch enables passing a timeout to the restricted_mutation_reader
through the read path interface -- using fill_buffer and friends. This
will serve as a basis for having per-timeout requests.
The config structure still has a timeout, but that is so far only used
to actually pass the value to the query interface. Once that starts
coming from the storage proxy layer (next patch) we will remove.
The query callers are patched so that we pass the timeout down. We patch
the callers in database.cc, but leave the streaming ones alone. That can
be safely done because the default for the query path is now no_timeout,
and that is what the streaming code wants. So there is no need to
complicate the interface to allow for passing a timeout that we intend
to disable.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.
In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
As part of the work to enable per-request timeouts, we enable timeouts
in fill_buffer.
The argument is made optional at the main classes, but mandatory in all
the ::impl versions. This way we'll make sure we didn't forget anything.
At this point we're still mostly passing that information around and
don't have any entity that will act on those timeouts. In the next patch
we will wire that up.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We pass the timeout that we received from data_query/mutation_query
down to consume, which is responsible for actually reading the data.
To make those timeouts actionable, though, we'll have to patch
fill_buffer(). This will happen in the next patch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
data_query and mutation_query are patched so that they start accepting a
per-query timeout. We will default to no timeout, and then no callers
will be changed yet.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.
As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.
In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Scylla's configure.py calls seastar/configure.py and uses seastar.pc
that it produces to generate Scylla's build.ninja. However, there is no
appropriate dependency in build.ninja and changes to
seastar/configure.py alone do not trigger regeneration of Scylla's
build.ninja. This patch remedies that problem.
Message-Id: <20180111144237.5259-1-pdziepak@scylladb.com>
On Debian 9, 'pbuilder create' fails because of lack of GPG key for
3rdparty repo, so we need --allow-untrusted on 'pbuilder create' and
'pbuilder update'.
Also, apt-key adv --fetch-keys does not works correctly on it, but we can use
"curl <URL> | apt-key add -" as workaround.
Fixes#3088
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513797714-18067-1-git-send-email-syuu@scylladb.com>
If the compaction_backlog_manager's lifetime ends before the linked
compaction_backlog_tracker's, the latter's _manager pointer not being
cleared, can lead to a use-after-free error when running
~compaction_backlog_tracker(), as evidenced by unit-tests failed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-2-duarte@scylladb.com>
The legacy mutation_reader/streamed_mutation design allowed very easily
to skip the partition merging logic if there was only one underlying
reader that has emitted it.
That optimisation was lost after conversion to flat mutation readers
which has impacted the performance. This patch mostly recovers it by
bypassing most of mutation_reader_merger logic if there is only a single
active reader for a given partition.
The performance regression was introduced in
8731c1bc66 "Flatten the implementation of
combined_mutation_reader".
perf_simple_query -c4 read results (medians of 60):
original regression
before 8731c1 after 8731c1 diff
read 326241.02 300244.09 -8.0%
this patch
before after diff
read 313882.59 325148.05 3.6%
Message-Id: <20180103121019.764-1-pdziepak@scylladb.com>
"This series revives the round-robin load balancing added by Pekka back in 2015.
If somebody tries to enable it with the current master it would quite quickly
lead to a crash due to a few unresolved issues in the corresponding code.
Fixes#2351Fixes#3118"
* 'fix-round-robin-balancing-v2' of github.com:vladzcloudius/scylla:
transport::server::process_request(): avoid extra copy of the client_state
service::cql_server::connection::process_request: use client_state "request copy" constructor
service::client_state: introduce "request copy" copy-constructor
service::storage_service: add the get_local_auth_service() accessor
service::client_state: remove the unused _tracing_session_id field
Don't use submit_to(...) when we are going to handle the request on a local
shard. Otherwise there is a not needed copy of the _client_state in the submit_to(...)
lambda capture list.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Create a cross-shard copy of the client_state object and give it to the single request handling
function and give it a timestamp generated by the original client_state instance (which is promised
to be monotonous).
Fixes#3118
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
A new constructor creates a copy of the current client_status to be
used in the context of the handling of a single request.
The copy may take place at a shard different from the one where the
request has been received.
In order to ensure the monotonicity of the timestamps used by the request handled
on the same connection the created copy of the client_state is going to use the same timestamp provided by the
caller instead of generating it.
It's the caller's responsibility to ensure the monotonicity of given timestamps.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
sprint() may need to allocate significant amount of memory if mutation
is large, and cause bad_alloc in
row_cache_test::test_concurrent_reads_and_eviction.
Message-Id: <1515486454-4913-1-git-send-email-tgrabiec@scylladb.com>
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.
In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.
Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test
Fixes#3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
After 611774b, we're blind again on which sstable caused a compaction
to fail, leaving us with cryptic message as follow:
compaction_manager - compaction failed: std::runtime_error (compressed
chunk failed checksum)
After this change, now both read failure in compaction or regular read
will report the guilty sstable, see:
compaction_manager - compaction failed: std::runtime_error (SSTable reader
found an exception when reading sstable ./data/.../keyspace1-standard1
ka-1-Data.db : std::runtime_error(compressed chunk failed checksum))
Fixes#3006.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com>
"This patchset implements the compaction controller for I/O shares. The
goal is to automatic adjust compaction shares based on a
strategy-specific backlog. A higher backlog will translate into higher
shares.
As compaction progresses, that reduces the backlog. As new data is
flushed, that increases the backlog. The goal of the controler is to
keep the backlog constant at a certain rate, so that we don't go neither
too fast or too slow.
Tracking reads and writes:
==========================
Tracking of reads and writes happen through the read_monitor and the
write_monitor. The write monitor is an existing interface that has the
purpose of releasing the write permit at particular points of the write
process. We enhance it so to get a reference to an instance that tracks
the current offset inside the sstables::file_writer. This way the
backlog tracker can always know for sure what's the offset of the
current write.
A similar thing is done for reads. The data_consumer already tracks the
position of the current read, and we isolate that into a structure to
which we can get a reference. A read_monitor allows us to connect the
compaction to that reference.
Lifetime management:
====================
In general, tracking objects will be owned by their callers and passed
down as references. The compaction object will own the read monitors and
the compaction write monitors and the memtable flush write monitor will
be kept alive in a do_with block around the flush itself.
The backlog_{write,read}_progress_manager needs to be kept alive until
the SSTable is no longer in progress. For writes, that means until we
are able to add the SSTable charges in full, and for reads (compaction)
that means until we are able to remove the charges in full.
It is important to do that to avoid spikes in the graph. If we remove
the progress managers in a different operation than updating the SSTable
list we will be left in a temporary state where charges appear or
disappear abruptly, to be fixed when the final
add_sstable/remove_sstable happens. So we want those things to happen
together.
The compaction_backlog_tracker is kept alive until the strategy changes,
for example, through ALTER TABLE. Current charges are transferred to the
new strategy's compaction_backlog_tracker object when we do that. If the
type of strategy changes, the current read charges are forgotten. We can
do that because those running compaction will not really contribute to
decrease the backlog of the new compaction strategy.
Tranfer of Charges
==================
When ALTER TABLE happens, we need to transfer ongoing writes to the new
backlog manager. Ongoing reads will still be tracked by the
backlog_manager that originated them.
The rationale for that is that reads still belong to the current
compaction, with the strategy that generated them. But new Tables being
written will add to the backlog of the new strategy.
Note that ALTER TABLE operations not necessarily cause a change of
Strategy. We can be using the same strategy but just changing
properties. If that is the case, we expect no discontinuity in the
backlog graph (tested).
Resharding
==========
Resharding compactions are more complex than normal compactions because
the SSTables are created in one shard and later sent to another shard.
It is better, then, to track resharding compactions separately and let
them have their own backlog tracker, which will insert backlog in
proportion to the amount of data to be resharded.
Memtable Flush I/O Controller
=============================
With the current infrastructure it becomes trivial to add a new
controller, for either I/O or CPU. This patchset then adds an I/O
controller for memtable flushes, using the same backlog algorithm that
we already used for CPU."
* 'compaction-controller-io-v5' of github.com:glommer/scylla:
database: add a controller for I/O on memtable flushes.
document the compaction controller
compaction: adjust shares for compactions
backlog_controllers: implement generic I/O controller
factor out some of the controller code
io shares: multiply all shares by 10
compaction_strategy: implement backlog manager for the SizeTiered strategy
infrastructure for backlog estimator for compaction work.
sstables: notify about end of data component write
sstables: add read_monitor_generator
sstables: add read_monitor
sstables: enhance data consumer with a position tracker
sstables: enhance the file_writer with an offset tracker
sstables: pass references instead of pointers for write_monitor
compaction: control destruction of readers
'"The issue is triggered by compaction of sstables of level higher than 0.
The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]
When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.
This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made."
Fixes #2908.'
* 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla:
tests: test for infinite recursion bug when doing high-level compaction
Fix potential infinite recursion when combining mutations for leveled compaction
dht: make it easier to create ring_position_view from token
dht: introduce is_min/max for ring_position