there's no way a L1 sstable will be in candidates set which was
previously built from list of L0 sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Enable pbuilder for Ubuntu/Debian to prevent build enviroment dependent issues.
Also support cross building by pbuilder.
(cross-building from Fedora 25 and Ubuntu 16.04 are tested)
closes#629
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497895661-26376-1-git-send-email-syuu@scylladb.com>
As Avi noticed, the "forwarding_tag" which was meant to be local in
streamed_mutation, became global. If another class copied the same trick,
it would share the same type instead of being distinct types as intended.
The problem is that in:
using forwarding = bool_class<class forwarding_tag>;
Apparently, the "class forwarding_tag" forward-declares a global type - it
does not create a local-scope type as intended, which the following apparently
does (even though no actual definition is given for that class):
class forwarding_tag;
using forwarding = bool_class<forwarding_tag>;
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619153933.13116-1-nyh@scylladb.com>
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
We mistakenly placed ./configure.py to dh_auto_build, but it's should place at
dh_auto_configure.
This bug causes issue #2505, since we haven't defined dh_auto_configure task yet(It seems running cmake on top of the dir is one of default behavior of dh_auto_configure).
Fixes#2505
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497866068-32097-1-git-send-email-syuu@scylladb.com>
This patch add the CMakeLists.txt file for IDEs based on cmake, like
CLion.
This file assumes the existence of a build/release/gen directory,
containing generated files.
Refs #867
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170618151333.94714-1-duarte@scylladb.com>
"This patch set ensures we quote the name of a UDT when it
contains characters that may cause parsing by the CQL parser
to fail.
Fixes#2491"
* 'cql3-quote-type/v1' of https://github.com/duarten/scylla:
cql3/util: Make maybe_quote() take argument by const reference
cql3/cql3_type: Quote UDT name if needed
schema: Lift maybe_quote() into cql3/util
Boost 1.55 (ubuntu 14) fails to compile because an iterator produce by
boost::adaptors::transformed() when std::ref to lambda is passed to
it do not match iterator concept. It cannot be default constructed
because std::reference_wrapper is not default constructable.
boost::range::min_element() never actually default construct it, but
concept is checked anyway. The patch fixes it by providing an explicit
functor that is default constructable.
Message-Id: <20170618131836.GD3944@scylladb.com>
When making the schema mutations for a view update, we should only
include the base table schema mutations (in case the target node
doesn't contain them) when the view is being directly updated. When it
is being updated as a side effect of updating the base table, then
including the base schema mutations will hide the actual changes being
performed on the base.
Fixes#2500
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2). It causes crashes
during range scans, reported by Gleb:
"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.
Backtrace:
at /home/gleb/work/seastar/seastar/core/apply.hh:36
rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
at ./seastar/core/future.hh:890
at /home/gleb/work/seastar/seastar/core/future-util.hh:119
at /home/gleb/work/seastar/seastar/core/future-util.hh:142
Add a performance test for deletion in addition to the existing update
and query tests. The deletion performance test is executed using the
'--delete' argument to perf_simple_query.
Fixes#2417.
Signed-off-by: Etienne Kruger <el@loadavg.io>
Message-Id: <20170615232500.26987-1-el@loadavg.io>
This patch ensures we properly quote a UDT name, which may contain
characters like ".", which can lead the name to be interpreted as a
keyspace qualified name when parsed by the CQL parser.
Fixes#2491
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The typo went unnoticed since the compiler picked up the global scope's
forwarding_tag. The bug made streamed_mutation::forwarding and
mutation_reader::forwarding the same type, but fortunately there were
no type mixups due to this.
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
"During read query with CL<ALL not all replicas are contacted. It is
possible for some replicas to cache less data for some CF's (for instance
because of node restart), so the replica choice may have a big impact
on request's completion latency and on amount of work it generates in
a cluster.
This patch series keep track of per CF cached hit ratio and uses this
information to choose best replicas for a request. Nodes with lower
hit ratios are still contacted in order to populate their cache, but
less frequently."
* 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev:
storage_proxy: load balance read requests according to cache hit rates
choose extra replica for speculation in filter_for_query()
consistency_level: drop filter_for_query_dc_local function
database: reset node's hit rate information on connection drop
messaging_service: connection drop notifier
Store cluster wide cache hit statistics in CF
messaging_service: return cache hit ratio as part of data read
Distribute cache temperature over gossiper.
periodically calculate avg cache hit rate between all shards
database: introduce cache_temperature class
Rename load_broadcaster.cc to misc_services.cc
storage_proxy: use db::count_local_endpoints function instead open code it
"This series reduces repair memory usage and improves repair speed."
* tag 'asias/fix-repair-2430-branch-master-v4.1' of github.com:cloudius-systems/seastar-dev:
repair: Repair on all shards
repair: Allow one stream plan in flight
Range tombstones were introduced in version 1.3 and there exists
no direct upgrade from 1.2 to vnext, so we can retire the code
enforcing backwards compatibility.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170614211654.82501-1-duarte@scylladb.com>
Currently commitlog_entry_writer constructor calculates serialized size
before it is knows if a schema should be included into the entry. The
result is never used since it is recalculated when schema information is
supplied. The patch removes needless calculation.
Message-Id: <20170614114607.GA21915@scylladb.com>
Currently each time UDT or tuple is parsed new object is created. If
those objects are used to create container type repeatedly it will cause
memory leak since container types are interned, but lookup in the
cache is done using pointer to a contained type (which will be always
different for UDT and tuples). This patches interns also UDT and tuple,
so each type the same object is parsed same pointer is also returned.
Refs #2469Fixes#2487
Message-Id: <20170612142942.GO21915@scylladb.com>
Currently, shard zero is the coordinator of the repair. All the work of
checksuming of the local node and sending of the repair checksum rpc
verb is done on shard zero only. This causes other shards being
underutilized.
With this patch, we split the ranges need to be repaired into at least
smp::count ranges, so sizeof(ranges) / smp::count will be assigned to
each shard. For exmaple, we have 8 shards and 256 ragnes, each shard
will repair 32 ranges. Each shard will repair the 32 ranges
sequencially. There will be at most 8 (smp::count) ranges of repair in
parallel.
In "repair: Use more stream_plan" (commit 2043ffc064), we
switched to do stream while doing checksum instead of do stream only
after checksum pahse is completed. We take a parallelism_semaphore
before we do checksum, if there are more than sub_ranges_to_stream
(1024) ranges, we start a stream_plan and wait for the streaming to
complete (still under the parallelism_semaphore). So at most
parallelism_semaphore (100) stream_plans can be in parallel.
The parallelism_semaphore limits the parallelism of both checksum and the
streaming plan. However, it is not necessary to have the same
parallelism for both checksum and streaming, because 1) a streaming
operation itself runs in parallel (handling ranges on all shards in
prallel, sending mutaitons in parallel) , 2) and with more streaming plan
(in worse case 100) means we can write to 100 memtables at the same time
and flush 100 memtables to disk at the same time which can take a lot of
memory.
With this patch, we only allow one stream plan in flight.
If we do two truncates in a row, the second will have neither memtable
nor sstable data. Thus we will not write/remove sstables, and thus
get no resulting truncation replay position.
Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>
This patch makes storage proxy to choose replicas to read from base on
their cache hit rates. Replicas with higher cache hit rates will see
more requests while replicas with lower hit rates will see less. Local
node has a special bonus and will get more requests even if another node
has slightly higher cache hit rate (same goes for local vs remote DC),
but after the patch it is no longer guarantied that a coordinator node
will be chosen as a replica for the read (if the feature is enabled).
Currently storage proxy has to loop over remaining replicas to search
for suitable extra replica, but doing it in filter_for_query() is
extremely easy, so do it there instead.
Merge filter_for_query_dc_local() functionality into filter_for_query().
This is more efficient since filter_for_query_dc_local() partitions
endpoints into 'local' and 'remote' set but filter_for_query() already
does it for CL=LOCAL so for such queries we needlessly do it twice.
Node may go down, so after it restarts cache hit rate info will be
incorrect and it can be overwhelmed with traffic until new and
up-to-date cache hit rate arrives. Solve this by dropping node's
information on connection reset, it is more accurate than relying on
gossip which may be slow and miss reboot of a node.
When a node start it does not have any information about cache temperature
of other nodes in the cluster and it is hard (if not impossible) to make
right guess. During cluster startup all nodes have cold caches, so there
is no point to redirect reads to other nodes even though local cache it
cold, but if only that node restarted than other nodes have populated
cache and reads should be redirected.
The node will get up-to-date information about other nodes caches,
but only after receiving first reply, until then it does not have the
information to make right decisions which may cause unwanted spikes
immediately after restart. Having cache temperature in gossiper helps
to solve the problem.
This patch adds new class cache_hitrate_calculator whose responsibility
is to periodically calculate average cache hit rates between all shards
for each CF.
end_bound() returns temporary object (end_bound_ref), so it cannot be
taken by reference here and used later. Copy instead.
Message-Id: <20170612132328.GJ21915@scylladb.com>
Apparently some GDB versions (7.11.1-86.fc24) don't parse double '>'
in a type name, so this:
std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family>>
should be this:
std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family> >
Message-Id: <1497256644-4335-1-git-send-email-tgrabiec@scylladb.com>
The problem is that 'key' is a 'bytes' object now, which doesn't have __format__.
Fixes the following error:
Traceback (most recent call last):
File "~/src/scylla/scylla-gdb.py", line 184, in invoke
TypeError: non-empty format string passed to object.__format__
Error occurred in Python command: non-empty format string passed to object.__format__
Message-Id: <1497253433-374-2-git-send-email-tgrabiec@scylladb.com>
"This series switches repair to use more stream plans to stream the mismatched
sub ranges and use a range generator to produce sub ranges.
Test shows no huge memory is used for repair with large data set.
In addition, we now have a progress reporter in the log how many ranges are processed.
Jun 06 14:18:22 [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
Jun 06 14:19:55 [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
Fixes #2430."
* tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev:
repair: Remove unused sub_ranges_max
repair: Reduce parallelism in repair_ranges
repair: Tweak the log a bit
repair: Use more stream_plan
repair: iterator over subranges instead of list