Compare commits

...

47 Commits

Author SHA1 Message Date
Pekka Enberg
7e1b245887 release: prepare for 1.6.0 2017-02-01 13:58:06 +02:00
Takuya ASADA
83fc7de65f dist: add lspci on dependencies, since it used by dpdk-devbind.py
On minimum setup environment scylla_sysconfig_setup will fail because lspci command is not installed. So install it on package installation time.

Fixes #2035

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485327435-20543-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit bce0fb3fa2)
2017-01-25 17:37:18 +02:00
Pekka Enberg
d4781f2de3 release: prepare for 1.6.rc2 2017-01-24 14:33:12 +02:00
Amos Kong
2193a83a82 dist/redhat: fix path of housekeeping.cfg
scylla-housekeeping[3857]: Config file /etc/scylla.d/housekeeping.cfg is missing, terminating

Housekeeping failed to execute for missing the config file,
the config file should be in /etc/scylla.d/.

Fixes #2020

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <e63f2f8cb94410a6dca4e6193932f0079755ad47.1484724328.git.amos@scylladb.com>
(cherry picked from commit b880bdccef)
2017-01-19 11:09:41 +02:00
Pekka Enberg
78d74bf23a dist/docker: Use Scylla 1.6 RPM repository 2017-01-18 11:58:42 +02:00
Pekka Enberg
642a479c73 release: prepare for 1.6.rc1 2017-01-16 19:02:05 +02:00
Tomasz Grabiec
8a6d0ad2fa storage_proxy: Fix capturing of on-stack variable by reference
partition_range_count was accepted by do_with callback by value and
then captured by reference by async code, thus invoking use after
destroy.

Message-Id: <1484317846-14485-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3c3a4358ae)
2017-01-16 11:49:34 +02:00
Tomasz Grabiec
37f73781ee storage_proxy: Add missing initialization of _short_read_allowed
Dropped by a1cafed370 ("storage_proxy:
handle range scans of sparsely populated tables").

Fixes the failure in update_cluster_layout_tests.TestUpdateClusterLayout test.

Message-Id: <1484317450-13525-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 66547e7d7c)
2017-01-13 16:49:30 +02:00
Takuya ASADA
6fd5442fb7 scylla-housekeeping: move uuid file to /var/lib/scylla-housekeeping
Since scylla-housekeeping running as scylla user, it doesn't have a permission
to create a file on /etc/scylla.d.
So introduce /var/lib/scylla-housekeeping which owns by scylla user, place uuid
file on the directory.

Fixes #2009

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484235946-12463-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit bee7f549a9)
2017-01-13 16:28:36 +02:00
Avi Kivity
e2777e508c Update seastar submodule
* seastar 2909f6c...0bfd7fe (1):
  > io_queue: remove owner number from metric name
2017-01-12 16:46:25 +02:00
Tomasz Grabiec
8cf7bbf208 storage_proxy: Fix use-after-free on one_or_two_partition_ranges
query_mutations_locally() takes one_or_two_partition_ranges by
reference and requires, indirectly, that it is kept alive until
operation resolves. However, we were passing expiring value to it, the
result of unwrap().

Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test

Another potential problem was that we were dereferencing "s" in the same
expression which move-constructs an argument out of it.

Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 1e8151b4f2)
2017-01-12 15:12:06 +02:00
Vlad Zolotarov
4b5742d3a6 tracing::trace_keyspace_helper: use generate_legacy_id() for CF IDs generation
Explicitly generate tables' IDs of tables from the system_traces KS  using
generate_legacy_id() in order to ensure all Nodes create these tables with
the same IDs.

This is going to prevent hitting issue #420.

Fixes #1976

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1484153725-31030-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit ca0a0f1458)
2017-01-12 11:36:52 +02:00
Avi Kivity
cf27d44412 config: disable new sharding algorithm
It still has problems:
 - while resharding a very large leveled compaction strategy table, a huge
   amount of tiny sstables are generated, overwhelming the file descriptor
   limits
 - there is a large impact on read latency while resharding is going on
2017-01-12 10:15:51 +02:00
Avi Kivity
7b40e19561 Update seastar submodule
* seastar 0b49f28...2909f6c (2):
  > file/dup: don't decrease refcnt twice when file is explicitly closed
  > file: add dup() support

Preparing for file descriptor reduction during resharding backport.
2017-01-10 16:59:26 +02:00
Duarte Nunes
210e66b2b8 query_pagers: Fix over-counting of rows
This patch fixes a regression introduced in 0518895, where we counted
one extra row per partition when it contained live, non static rows.

We also simplify the visitor logic further, since now we don't need to
count rows one by one. Also remove a bunch of unused fields.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482234083-2447-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit d7e607ff51)
2017-01-10 16:54:09 +02:00
Amnon Heiman
6af51c1b1d scylla_setup: remove the uuid file creation
Scylla housekeeping can crete a uuid file if it is missing. There is no
longer need to create one for it.

Fixes #2004

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-3-git-send-email-amnon@scylladb.com>
(cherry picked from commit 8cd3d7445c)
2017-01-09 17:00:49 +02:00
Amnon Heiman
05f2bf5bd5 scylla-housekeeping: Create a uuid file if one is missing
This patch gets housekeeping to create a uuid file if a path to a uuid
file is upplied but the file is missing.

Because it import the uuid lib, uuid parameters where renamed.

Fixes #1987

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-2-git-send-email-amnon@scylladb.com>
(cherry picked from commit 32888fc0aa)
2017-01-09 15:27:04 +02:00
Avi Kivity
8002326f80 storage_proxy: prevent short read due to buffer size limit from being swallowed during range scan
mutation_result_merger::get() assumes that the merged result may be a
short read if at least one of the partial results is a short read (in
other words, if none of the partial results is a short read, then the
merged result is also not a short read). However this is not true;
because we update the memory accounter incrementally, we may stop
scanning early. All the partial results are full; but we did not scan
the entire range.

Fix by changing the short_read variable initialization from `no`
(which assumes we'll encounter a short read indication when processing
one of the batches) to `this->short_read()`, which also takes into
account the memory accounter.

Fixes #2001.
Message-Id: <20170108111315.17877-1-avi@scylladb.com>

(cherry picked from commit 8f36dca6f1)
2017-01-09 09:27:56 +00:00
Tomasz Grabiec
a00a1a1044 db: Make system tables use the commitlog
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.

Fixes #1986.

Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.

Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.

Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit cd630fece6)
2017-01-05 14:54:11 +02:00
Avi Kivity
337d6fb2cf storage_proxy: fix result ordering for parallel partition range scans
During a range scan, we try to avoid sorting according to partition range
when we can do so.  This is when we scan fewer than smp::count shards --
each shard's range is strictly ordered with respect to the others.

However, we use the wrong key for the sort -- we use the shard number.  But
if we started at shard s > 0 and wrapped around to shard 0, then shard 0's
range will be after the range belonging to shard s, but will sort before it.

Fix by storing the iteration order as the sort key.  We use that when we
know that shards do not overlap (shards < smp::count) and the index within
the source partition range vector when they do.

Fixes #1998.
Message-Id: <20170105114253.17492-1-avi@scylladb.com>

(cherry picked from commit eb520e7352)
2017-01-05 12:56:33 +01:00
Avi Kivity
1a8211e573 result_memory_tracker: fix too-short short reads
1.6 truncates paged queries early to avoid overrunning server memory
with too-large query results, but in the case of partition range queries,
this terminates too early due to an uninitialized variable holding the
maximum result size.  This results in slow performance due to additional
round trips.

Fix by initializing the maximum result size from the result_memory_tracker
running on the coordinating shard.

Fixes #1995.
Message-Id: <20170105103915.10633-1-avi@scylladb.com>

(cherry picked from commit 4667641f5f)
2017-01-05 11:04:03 +00:00
Takuya ASADA
a1a6c10964 dist/redhat: add python-setuptools on dependency since it requires for scylla-housekeeping
scylla-housekeeping breaks when python-setuptools doesn't installed, so
add it on dependency.

Fixes #1884

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483525828-7507-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 43655512e1)
2017-01-04 14:32:52 +02:00
Avi Kivity
c692824786 Update seastar submodule
* seastar 1e45fc8...0b49f28 (2):
  > metrics: Metrics function should take variable as a refernce
  > collectd: create metrics with the right format
2017-01-04 12:41:33 +02:00
Gleb Natapov
0ccdbbf1af storage_proxy: do not deref unengaged stdx:optional
Fixes intentional short reads.

Message-Id: <20161227142133.GE1829@scylladb.com>
(cherry picked from commit 4ca58959ad)
2017-01-01 12:16:45 +02:00
Takuya ASADA
138ad64cbc dist/ubuntu: check lsb_release existance since it's not included minimal Debian installation
Ubuntu has it in minimal installation but Debian doesn't, so add it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483003565-2753-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit e48cc9cf01)
2016-12-29 16:40:20 +02:00
Pekka Enberg
01737a51a9 tracing: Add seastar/core/scollectd.hh include
Fix the following build breakage:

FAILED: build/release/gen/cql3/CqlParser.o
g++ -MMD -MT build/release/gen/cql3/CqlParser.o -MF build/release/gen/cql3/CqlParser.o.d -std=gnu++1y -g  -Wall -Werror -fvisibility=hidden -pthread -I/home/penberg/scylla/seastar -I/home/penberg/scylla/seastar/fmt -I/home/penberg/scylla/seastar/build/release/gen  -march=nehalem -Ifmt -DBOOST_TEST_DYN_LINK -Wno-overloaded-virtual -DFMT_HEADER_ONLY -DHAVE_HWLOC -DHAVE_NUMA -DHAVE_LZ4_COMPRESS_DEFAULT  -O2 -DBOOST_TEST_DYN_LINK  -Wno-maybe-uninitialized -DHAVE_LIBSYSTEMD=1 -I. -I build/release/gen -I seastar -I seastar/build/release/gen -c -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp
In file included from ./query-request.hh:31:0,
                 from ./locator/token_metadata.hh:51,
                 from ./locator/abstract_replication_strategy.hh:29,
                 from ./database.hh:26,
                 from ./service/storage_proxy.hh:44,
                 from ./db/schema_tables.hh:43,
                 from ./db/system_keyspace.hh:46,
                 from ./cql3/functions/function_name.hh:45,
                 from ./cql3/selection/selectable.hh:48,
                 from ./cql3/selection/writetime_or_ttl.hh:45,
                 from build/release/gen/cql3/CqlParser.hpp:63,
                 from build/release/gen/cql3/CqlParser.cpp:44:
./tracing/tracing.hh:357:5: error: ‘scollectd’ does not name a type
     scollectd::registrations _registrations;
     ^~~~~~~~~

Message-Id: <1482939751-8756-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit a443dfa95e)
2016-12-28 19:16:30 +02:00
Pekka Enberg
9753a39284 Update seastar submodule
* seastar 0b98024...1e45fc8 (1):
  > Merge "migrate network related seastar collectd metrics to the new metrics registration API" from Vlad
2016-12-28 17:06:06 +02:00
Avi Kivity
42a76567b7 dht: use nonwrapping_ranges in ring_position_range_sharder
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges.  Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>

(cherry picked from commit 8686a59ea5)
2016-12-27 19:16:26 +02:00
Avi Kivity
725949e8bf Merge "Fixes for intentional short reads" from Paweł
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.

I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.

It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."

* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
  query: introduce result_memory_accounter::foreign_state
  storage_proxy: fix short reads in parallel range queries
  storage_proxy: pass maximum result size to replicas
  mutation_partition: use result limiter for digest reads
  query: make result_memory_limiter constants available for linker
  result_memory_limiter: add accounter for digest reads
  idl: allow writers to use any output stream
  result_memory_limiter: split new_read() to new_{data, mutation}_read()
  idl: is_short_read() was added in 1.6
  mutation_partition: honour allowed_short_read for static rows
  storage_proxy: fix _is_short_read computation
  storage_proxy: disallow short reads if got no live rows
  storage_proxy: don't stop after result with no live rows

(cherry picked from commit 868b4d110c)
2016-12-27 19:16:15 +02:00
Amnon Heiman
231cf22c0e Set the prometheus prefix to scylla
This patch make the prometheus prefix configurable and set the default
value to scylla.

Fixes #1964

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>
(cherry picked from commit 70b2a1bfd4)
2016-12-27 17:57:08 +02:00
Takuya ASADA
ef0ffd1cbb dist/common/scripts/scylla_setup: improve the message of disk selection prompt
Not to confuse users, describe we only list up unmounted disks.

Fixes #1841

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479720708-6021-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 7c3b98806d)
2016-12-27 17:56:11 +02:00
Gleb Natapov
2b67c65eb6 messaging_service: move MUTATION_DONE messages to separate connection
If a node gets more MUTATION request that it can handle via RPC it will
stop reading from this RPC connection, but this will prevent it from
getting MUTATION_DONE responses for requests it coordinates because
currently MUTATION and MUTATION_DONE messages shares same connection.

To solve this problem this patches moves MUTATION_DONE messages to
separate connection.

Fixes: #1843

Message-Id: <20161201155942.GC11581@scylladb.com>
(cherry picked from commit 0a2dd39c75)
2016-12-27 17:55:01 +02:00
Piotr Jastrzebski
5b0971f82f mutation_partition: don't use unique_ptr to manage LSA objects
Unique_ptr won't destruct them correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>
(cherry picked from commit 3e502de153)
2016-12-27 17:54:45 +02:00
Raphael S. Carvalho
da843239bf sstables: fix calculation of memory footprint for summary
size of keys weren't taken into account, so value reported
via collectd is much smaller than actual footprint.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3ca24612e4e84d1cbdea4f2d79e431a4f4479291.1482255327.git.raphaelsc@scylladb.com>
(cherry picked from commit e28537b56f)
2016-12-27 17:54:35 +02:00
Avi Kivity
5c96b04f4d Revert "config, dht: reduce default msb ignore bits to 4"
This reverts commit b81a57e8eb.

With exponential range scanning, we should now be able to survive
msb ignore bits of 12, which allows better sharding on large clusters.

(cherry picked from commit 3989e4ed15)
2016-12-27 17:54:09 +02:00
Avi Kivity
a1d463900f storage_proxy: handle range scans of sparsely populated tables
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard.  With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.

Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).

If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation.  This
optimizes for the dense(r) case, where few partial results are required.

If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys.  This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.

We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.

[tgrabiec: trivial conflicts]

Message-Id: <20161220170532.25173-1-avi@scylladb.com>
(cherry picked from commit a1cafed370)
2016-12-27 16:57:18 +02:00
Avi Kivity
66598a68d5 tests: adjust mutation_query_test for partition and row limits
Won't build otherwise.

(cherry picked from commit b740aff777)
2016-12-27 16:53:35 +02:00
Avi Kivity
8ad0e96025 Merge "storage_proxy: Enforce row limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. Uppers layers like the pager may
already be depending on this behavior."

* 'enforce-row-limit/v3' of https://github.com/duarten/scylla:
  query_pagers: Don't trim returned rows
  select_statement: Don't always trim result set
  query_result_merger: Limit rows
  mutation_query: to_data_query_result enforces row limit

(cherry picked from commit 3421ebe8be)
2016-12-27 16:36:41 +02:00
Avi Kivity
1a2a63787a Merge "storage_proxy: Enforce partition limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."

* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
  storage_proxy: Decrease limits when retrying command
  storage_proxy: Don't fetch superfluous partitions
  query::result: Add partition count
  column_family: Use counters in query::result::builder
  query_result_builder: Use the underlying counters
  mutation_partition: Count partitions in query_compacted
  mutation_partition: Remove tabs in query_compacted
  query::result::builder: Add partition count
  query_result_merger: Limit partitions

(cherry picked from commit 6bb875bdb7)
2016-12-27 16:36:27 +02:00
Avi Kivity
52e5706147 Point seastar submodule at scylla-seastar.git
Allow separate management of Scylla 1.6's version of sestar.
2016-12-27 16:33:22 +02:00
Pekka Enberg
cec07ea366 release: prepare for 1.6.rc0 2016-12-27 12:41:34 +02:00
Benoît Canet
cbe729415e scylla_setup: Use blkid or ls to list potentials block devices
blkid does not list root raw device.

Revert to lsblk while taking care of having a fallback
path in case the -p option is not supported.

Fixes #1963.

Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20161225100204.13297-1-benoit@scylladb.com>
(cherry picked from commit a24ff47c63)
2016-12-27 12:34:56 +02:00
Raphael S. Carvalho
b5190f9971 db: avoid excessive disk usage during sstable resharding
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.

Fixes #1952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
(cherry picked from commit 27fb8ec512)
2016-12-27 12:19:25 +02:00
Tomasz Grabiec
7739456ec2 sstables: Fix double close on index and data files when writing fails
file output streams take the responsibility of closing the file, they
will close the file as part of closing the stream.

During sstable writing we create sstable object and keep file
references there as well. Sstable object also has responsibility for
closing the files, and does so from sstable::~sstable().

Double close was supposed to be avoided by a construct like this:

  writer.close().get();
  _file = {};

However if close() failed, which can happen when write-ahead failed,
_file would not be cleared, and both the writer and sstable would
close the file. This will result in a crash in
append_challenged_posix_file_impl::close(), which is not prepared to
be closed twice.

Another problem is that if exception happened before we reached that
construct, we still should close the writer. Currently we don't, so
there's no double close on the file, but that's a bug which needs to
be fixed and once that's fixed double close on _file will be even more
likely.

The fix employed here is to not keep files inside sstable object when
writing. As soon as the writer is constructed, it's the only owner of
the file.

Fixes #1764.

Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f2a63270d1)
2016-12-27 11:11:51 +02:00
Takuya ASADA
e4da0167d4 dist/redhat: don't try to adduser when user is already exists
Currently we get "failed adding user 'scylla'" on .rpm installation when user is already exists, we can skip it to prevent error.

Fixes #1958

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1482550075-27939-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit f3e45bc9ef)
2016-12-27 09:47:52 +02:00
Vlad Zolotarov
a3266060e3 tracing: don't start tracing until a Tracing service is fully initialized
RPC messaging service is initialized before the Tracing service, so
we should prevent creation of tracing spans before the service is
fully initialized.

We will use an already existing "_down" state and extend it in a way
that !_down equals "started", where "started" is TRUE when the local
service is fully initialized.

We will also split the Tracing service initialization into two parts:
   1) Initialize the sharded object.
   2) Start the tracing service:
      - Create the I/O backend service.
      - Enable tracing.

Fixes issue #1939

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 62cad0f5f5)
2016-12-21 12:49:41 +02:00
Glauber Costa
f6c83f73ef track streaming and system virtual dirty memory
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.

After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
(cherry picked from commit 7133583797)
2016-12-21 12:44:30 +02:00
49 changed files with 891 additions and 372 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=1.6.0
if test -f version
then

View File

@@ -44,7 +44,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
mutation_partition_serializer part_ser(*m.schema(), m.partition());
bytes_ostream out;
ser::writer_of_canonical_mutation wr(out);
ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())

View File

@@ -307,7 +307,8 @@ select_statement::execute(distributed<service::storage_proxy>& proxy,
// doing post-query ordering.
if (needs_post_query_ordering() && _limit) {
return do_with(std::forward<std::vector<query::partition_range>>(partition_ranges), [this, &proxy, &state, &options, cmd](auto prs) {
query::result_merger merger;
assert(cmd->partition_limit == query::max_partitions);
query::result_merger merger(cmd->row_limit * prs.size(), query::max_partitions);
return map_reduce(prs.begin(), prs.end(), [this, &proxy, &state, &options, cmd] (auto pr) {
std::vector<query::partition_range> prange { pr };
auto command = ::make_lw_shared<query::read_command>(*cmd);
@@ -341,7 +342,8 @@ select_statement::execute_internal(distributed<service::storage_proxy>& proxy,
if (needs_post_query_ordering() && _limit) {
return do_with(std::move(partition_ranges), [this, &proxy, &state, command] (auto prs) {
query::result_merger merger;
assert(command->partition_limit == query::max_partitions);
query::result_merger merger(command->row_limit * prs.size(), query::max_partitions);
return map_reduce(prs.begin(), prs.end(), [this, &proxy, &state, command] (auto pr) {
std::vector<query::partition_range> prange { pr };
auto cmd = ::make_lw_shared<query::read_command>(*command);
@@ -375,8 +377,8 @@ select_statement::process_results(foreign_ptr<lw_shared_ptr<query::result>> resu
if (_is_reversed) {
rs->reverse();
}
rs->trim(cmd->row_limit);
}
rs->trim(cmd->row_limit);
return ::make_shared<transport::messages::result_message::rows>(std::move(rs));
}

View File

@@ -816,6 +816,12 @@ void column_family::load_sstable(sstables::shared_sstable& sst, bool reset_level
// several shards, but we can't start any compaction before all the sstables
// of this CF were loaded. So call this function to start rewrites, if any.
void column_family::start_rewrite() {
// submit shared sstables in generation order to guarantee that all shards
// owning a sstable will agree on its deletion nearly the same time,
// therefore, reducing disk space requirements.
boost::sort(_sstables_need_rewrite, [] (const sstables::shared_sstable& x, const sstables::shared_sstable& y) {
return x->generation() < y->generation();
});
for (auto sst : _sstables_need_rewrite) {
dblog.info("Splitting {} for shard", sst->get_filename());
_compaction_manager.submit_sstable_rewrite(this, sst);
@@ -1670,14 +1676,40 @@ database::database(const db::config& cfg)
dblog.info("Row: max_vector_size: {}, internal_count: {}", size_t(row::max_vector_size), size_t(row::internal_count));
}
void
dirty_memory_manager::setup_collectd(sstring namestr) {
_collectd.push_back(
scollectd::add_polled_metric(scollectd::type_instance_id("memory"
, scollectd::per_cpu_plugin_instance
, "bytes", namestr + "_dirty")
, scollectd::make_typed(scollectd::data_type::GAUGE, [this] {
return real_dirty_memory();
})));
_collectd.push_back(
scollectd::add_polled_metric(scollectd::type_instance_id("memory"
, scollectd::per_cpu_plugin_instance
, "bytes", namestr +"_virtual_dirty")
, scollectd::make_typed(scollectd::data_type::GAUGE, [this] {
return virtual_dirty_memory();
})));
}
void
database::setup_collectd() {
_dirty_memory_manager.setup_collectd("regular");
_system_dirty_memory_manager.setup_collectd("system");
_streaming_dirty_memory_manager.setup_collectd("streaming");
_collectd.push_back(
scollectd::add_polled_metric(scollectd::type_instance_id("memory"
, scollectd::per_cpu_plugin_instance
, "bytes", "dirty")
, scollectd::make_typed(scollectd::data_type::GAUGE, [this] {
return _dirty_memory_manager.real_dirty_memory();
return _dirty_memory_manager.real_dirty_memory() +
_system_dirty_memory_manager.real_dirty_memory() +
_streaming_dirty_memory_manager.real_dirty_memory();
})));
_collectd.push_back(
@@ -1685,7 +1717,9 @@ database::setup_collectd() {
, scollectd::per_cpu_plugin_instance
, "bytes", "virtual_dirty")
, scollectd::make_typed(scollectd::data_type::GAUGE, [this] {
return _dirty_memory_manager.virtual_dirty_memory();
return _dirty_memory_manager.virtual_dirty_memory() +
_system_dirty_memory_manager.virtual_dirty_memory() +
_streaming_dirty_memory_manager.virtual_dirty_memory();
})));
_collectd.push_back(
@@ -1956,13 +1990,12 @@ future<> database::parse_system_tables(distributed<service::storage_proxy>& prox
future<>
database::init_system_keyspace() {
bool durable = _cfg->data_file_directories().size() > 0;
db::system_keyspace::make(*this, durable, _cfg->volatile_system_keyspace_for_testing());
// FIXME support multiple directories
return io_check(touch_directory, _cfg->data_file_directories()[0] + "/" + db::system_keyspace::NAME).then([this] {
return populate_keyspace(_cfg->data_file_directories()[0], db::system_keyspace::NAME).then([this]() {
return init_commitlog();
return init_commitlog().then([this] {
bool durable = _cfg->data_file_directories().size() > 0;
db::system_keyspace::make(*this, durable, _cfg->volatile_system_keyspace_for_testing());
// FIXME support multiple directories
return io_check(touch_directory, _cfg->data_file_directories()[0] + "/" + db::system_keyspace::NAME).then([this] {
return populate_keyspace(_cfg->data_file_directories()[0], db::system_keyspace::NAME);
});
}).then([this] {
auto& ks = find_keyspace(db::system_keyspace::NAME);
@@ -2382,29 +2415,33 @@ struct query_state {
std::vector<query::partition_range>::const_iterator current_partition_range;
std::vector<query::partition_range>::const_iterator range_end;
mutation_reader reader;
uint32_t remaining_rows() const {
return limit - builder.row_count();
}
uint32_t remaining_partitions() const {
return partition_limit - builder.partition_count();
}
bool done() const {
return !limit || !partition_limit || current_partition_range == range_end || builder.is_short_read();
return !remaining_rows() || !remaining_partitions() || current_partition_range == range_end || builder.is_short_read();
}
};
future<lw_shared_ptr<query::result>>
column_family::query(schema_ptr s, const query::read_command& cmd, query::result_request request,
const std::vector<query::partition_range>& partition_ranges,
tracing::trace_state_ptr trace_state, query::result_memory_limiter& memory_limiter) {
tracing::trace_state_ptr trace_state, query::result_memory_limiter& memory_limiter,
uint64_t max_size) {
utils::latency_counter lc;
_stats.reads.set_latency(lc);
auto f = request == query::result_request::only_digest
? make_ready_future<query::result_memory_accounter>() : memory_limiter.new_read();
? memory_limiter.new_digest_read(max_size) : memory_limiter.new_data_read(max_size);
return f.then([this, lc, s = std::move(s), &cmd, request, &partition_ranges, trace_state = std::move(trace_state)] (query::result_memory_accounter accounter) mutable {
auto qs_ptr = std::make_unique<query_state>(std::move(s), cmd, request, partition_ranges, std::move(accounter));
auto& qs = *qs_ptr;
return do_until(std::bind(&query_state::done, &qs), [this, &qs, trace_state = std::move(trace_state)] {
auto&& range = *qs.current_partition_range++;
return data_query(qs.schema, as_mutation_source(trace_state), range, qs.cmd.slice, qs.limit, qs.partition_limit,
qs.cmd.timestamp, qs.builder).then([&qs] (auto&& r) {
qs.limit -= r.live_rows;
qs.partition_limit -= r.partitions;
});
return data_query(qs.schema, as_mutation_source(trace_state), range, qs.cmd.slice, qs.remaining_rows(),
qs.remaining_partitions(), qs.cmd.timestamp, qs.builder);
}).then([qs_ptr = std::move(qs_ptr), &qs] {
return make_ready_future<lw_shared_ptr<query::result>>(
make_lw_shared<query::result>(qs.builder.build()));
@@ -2428,9 +2465,10 @@ column_family::as_mutation_source(tracing::trace_state_ptr trace_state) const {
}
future<lw_shared_ptr<query::result>>
database::query(schema_ptr s, const query::read_command& cmd, query::result_request request, const std::vector<query::partition_range>& ranges, tracing::trace_state_ptr trace_state) {
database::query(schema_ptr s, const query::read_command& cmd, query::result_request request, const std::vector<query::partition_range>& ranges, tracing::trace_state_ptr trace_state,
uint64_t max_result_size) {
column_family& cf = find_column_family(cmd.cf_id);
return cf.query(std::move(s), cmd, request, ranges, std::move(trace_state), get_result_memory_limiter()).then_wrapped([this, s = _stats] (auto f) {
return cf.query(std::move(s), cmd, request, ranges, std::move(trace_state), get_result_memory_limiter(), max_result_size).then_wrapped([this, s = _stats] (auto f) {
if (f.failed()) {
++s->total_reads_failed;
} else {

View File

@@ -149,7 +149,11 @@ class dirty_memory_manager: public logalloc::region_group_reclaimer {
bool has_pressure() const {
return over_soft_limit();
}
std::vector<scollectd::registration> _collectd;
public:
void setup_collectd(sstring namestr);
future<> shutdown();
// Limits and pressure conditions:
@@ -650,7 +654,8 @@ public:
const query::read_command& cmd, query::result_request request,
const std::vector<query::partition_range>& ranges,
tracing::trace_state_ptr trace_state,
query::result_memory_limiter& memory_limiter);
query::result_memory_limiter& memory_limiter,
uint64_t max_result_size);
future<> populate(sstring datadir);
@@ -1162,7 +1167,8 @@ public:
unsigned shard_of(const dht::token& t);
unsigned shard_of(const mutation& m);
unsigned shard_of(const frozen_mutation& m);
future<lw_shared_ptr<query::result>> query(schema_ptr, const query::read_command& cmd, query::result_request request, const std::vector<query::partition_range>& ranges, tracing::trace_state_ptr trace_state);
future<lw_shared_ptr<query::result>> query(schema_ptr, const query::read_command& cmd, query::result_request request, const std::vector<query::partition_range>& ranges,
tracing::trace_state_ptr trace_state, uint64_t max_result_size);
future<reconcilable_result> query_mutations(schema_ptr, const query::read_command& cmd, const query::partition_range& range,
query::result_memory_accounter&& accounter, tracing::trace_state_ptr trace_state);
// Apply the mutation atomically.

View File

@@ -736,8 +736,9 @@ public:
val(lsa_reclamation_step, size_t, 1, Used, "Minimum number of segments to reclaim in a single step") \
val(prometheus_port, uint16_t, 9180, Used, "Prometheus port, set to zero to disable") \
val(prometheus_address, sstring, "0.0.0.0", Used, "Prometheus listening address") \
val(prometheus_prefix, sstring, "scylla", Used, "Set the prefix of the exported Prometheus metrics. Changing this will break Scylla's dashboard compatibility, do not change unless you know what you are doing.") \
val(abort_on_lsa_bad_alloc, bool, false, Used, "Abort when allocation in LSA region fails") \
val(murmur3_partitioner_ignore_msb_bits, unsigned, 4, Used, "Number of most siginificant token bits to ignore in murmur3 partitioner; increase for very large clusters") \
val(murmur3_partitioner_ignore_msb_bits, unsigned, 0, Used, "Number of most siginificant token bits to ignore in murmur3 partitioner; increase for very large clusters") \
/* done! */
#define _make_value_member(name, type, deflt, status, desc, ...) \

View File

@@ -276,15 +276,19 @@ ring_position_range_vector_sharder::ring_position_range_vector_sharder(std::vect
next_range();
}
stdx::optional<ring_position_range_and_shard>
stdx::optional<ring_position_range_and_shard_and_element>
ring_position_range_vector_sharder::next(const schema& s) {
if (!_current_sharder) {
return stdx::nullopt;
}
auto ret = _current_sharder->next(s);
while (!ret && _current_range != _ranges.end()) {
auto range_and_shard = _current_sharder->next(s);
while (!range_and_shard && _current_range != _ranges.end()) {
next_range();
ret = _current_sharder->next(s);
range_and_shard = _current_sharder->next(s);
}
auto ret = stdx::optional<ring_position_range_and_shard_and_element>();
if (range_and_shard) {
ret.emplace(std::move(*range_and_shard), _current_range - _ranges.begin() - 1);
}
return ret;
}

View File

@@ -445,13 +445,20 @@ class ring_position_range_sharder {
nonwrapping_range<ring_position> _range;
bool _done = false;
public:
explicit ring_position_range_sharder(range<ring_position> rrp)
explicit ring_position_range_sharder(nonwrapping_range<ring_position> rrp)
: ring_position_range_sharder(global_partitioner(), std::move(rrp)) {}
ring_position_range_sharder(const i_partitioner& partitioner, range<ring_position> rrp)
ring_position_range_sharder(const i_partitioner& partitioner, nonwrapping_range<ring_position> rrp)
: _partitioner(partitioner), _range(std::move(rrp)) {}
stdx::optional<ring_position_range_and_shard> next(const schema& s);
};
struct ring_position_range_and_shard_and_element : ring_position_range_and_shard {
ring_position_range_and_shard_and_element(ring_position_range_and_shard&& rpras, unsigned element)
: ring_position_range_and_shard(std::move(rpras)), element(element) {
}
unsigned element;
};
class ring_position_range_vector_sharder {
using vec_type = std::vector<nonwrapping_range<ring_position>>;
vec_type _ranges;
@@ -465,7 +472,8 @@ private:
}
public:
explicit ring_position_range_vector_sharder(std::vector<nonwrapping_range<ring_position>> ranges);
stdx::optional<ring_position_range_and_shard> next(const schema& s);
// results are returned sorted by index within the vector first, then within each vector item
stdx::optional<ring_position_range_and_shard_and_element> next(const schema& s);
};
nonwrapping_range<ring_position> to_partition_range(nonwrapping_range<dht::token>);

View File

@@ -79,8 +79,16 @@ verify_package() {
fi
}
list_block_devices() {
if lsblk --help | grep -q -e -p; then
lsblk -pnr | awk '{ print $1 }'
else
ls -1 /dev/sd* /dev/hd* /dev/xvd* /dev/nvme* /dev/mapper/* 2>/dev/null|grep -v control
fi
}
get_unused_disks() {
blkid -c /dev/null|cut -f 1 -d ' '|sed s/://g|grep -v loop|while read dev
list_block_devices|grep -v loop|while read dev
do
count_raw=$(grep $dev /proc/mounts|wc -l)
count_pvs=0
@@ -266,14 +274,9 @@ if [ $ENABLE_SERVICE -eq 1 ]; then
fi
fi
if [ ! -f /etc/scylla.d/housekeeping.uuid ]; then
uuidgen > /etc/scylla.d/housekeeping.uuid
fi
UUID=`cat /etc/scylla.d/housekeeping.uuid` || true
CUR_VERSION=`scylla --version` || true
if [ "$CUR_VERSION" != "" ] && [ "$UUID" != "" ]; then
NEW_VERSION=`/usr/lib/scylla/scylla-housekeeping --uuid $UUID version --version $CUR_VERSION --mode i` || true
NEW_VERSION=`sudo -u scylla /usr/lib/scylla/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid version --version $CUR_VERSION --mode i` || true
if [ "$NEW_VERSION" != "" ]; then
echo $NEW_VERSION
fi
@@ -316,7 +319,7 @@ if [ $INTERACTIVE -eq 1 ]; then
echo
RAID_SETUP=0
else
echo "Please select disks from the following list: $DEVS"
echo "Please select unmounted disks from the following list: $DEVS"
fi
while [ "$DEVS" != "" ]; do
echo "type 'done' to finish selection. selected: $DISKS"

View File

@@ -6,7 +6,7 @@ After=network.target
Type=simple
User=scylla
Group=scylla
ExecStart=/usr/lib/scylla/scylla-housekeeping --uuid-file /etc/scylla.d/housekeeping.uuid -q -c /etc/scylla.d/housekeeping.cfg version --mode d
ExecStart=/usr/lib/scylla/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid -q -c /etc/scylla.d/housekeeping.cfg version --mode d
[Install]
WantedBy=multi-user.target

View File

@@ -7,7 +7,7 @@ ENV container docker
VOLUME [ "/sys/fs/cgroup" ]
#install scylla
RUN curl http://downloads.scylladb.com/rpm/unstable/centos/master/latest/scylla.repo -o /etc/yum.repos.d/scylla.repo
RUN curl http://downloads.scylladb.com/rpm/centos/scylla-1.6.repo -o /etc/yum.repos.d/scylla.repo
RUN yum -y install epel-release
RUN yum -y clean expire-cache
RUN yum -y update

View File

@@ -30,7 +30,7 @@ URL: http://www.scylladb.com/
BuildRequires: libaio-devel libstdc++-devel cryptopp-devel hwloc-devel numactl-devel libpciaccess-devel libxml2-devel zlib-devel thrift-devel yaml-cpp-devel lz4-devel snappy-devel jsoncpp-devel systemd-devel xz-devel pcre-devel elfutils-libelf-devel bzip2-devel keyutils-libs-devel xfsprogs-devel make gnutls-devel systemd-devel lksctp-tools-devel protobuf-devel protobuf-compiler libunwind-devel systemtap-sdt-devel
%{?fedora:BuildRequires: boost-devel ninja-build ragel antlr3-tool antlr3-C++-devel python3 gcc-c++ libasan libubsan python3-pyparsing dnf-yum}
%{?rhel:BuildRequires: scylla-libstdc++-static scylla-boost-devel scylla-boost-static scylla-ninja-build scylla-ragel scylla-antlr3-tool scylla-antlr3-C++-devel python34 scylla-gcc-c++ >= 5.1.1, python34-pyparsing}
Requires: scylla-conf systemd-libs hwloc collectd PyYAML python-urwid pciutils pyparsing python-requests curl util-linux
Requires: scylla-conf systemd-libs hwloc collectd PyYAML python-urwid pciutils pyparsing python-requests curl util-linux python-setuptools pciutils
%{?rhel:Requires: python34 python34-PyYAML}
Conflicts: abrt
@@ -88,7 +88,7 @@ install -m755 dist/common/bin/scyllatop $RPM_BUILD_ROOT%{_bindir}
install -m755 scylla-blocktune $RPM_BUILD_ROOT%{_prefix}/lib/scylla/
install -m755 scylla-housekeeping $RPM_BUILD_ROOT%{_prefix}/lib/scylla/
if @@HOUSEKEEPING_CONF@@; then
install -m644 conf/housekeeping.cfg $RPM_BUILD_ROOT%{_sysconfdir}/scylla/
install -m644 conf/housekeeping.cfg $RPM_BUILD_ROOT%{_sysconfdir}/scylla.d/
fi
install -d -m755 $RPM_BUILD_ROOT%{_docdir}/scylla
install -m644 README.md $RPM_BUILD_ROOT%{_docdir}/scylla/
@@ -101,6 +101,7 @@ install -d -m755 $RPM_BUILD_ROOT%{_sharedstatedir}/scylla/
install -d -m755 $RPM_BUILD_ROOT%{_sharedstatedir}/scylla/data
install -d -m755 $RPM_BUILD_ROOT%{_sharedstatedir}/scylla/commitlog
install -d -m755 $RPM_BUILD_ROOT%{_sharedstatedir}/scylla/coredump
install -d -m755 $RPM_BUILD_ROOT%{_sharedstatedir}/scylla-housekeeping
install -d -m755 $RPM_BUILD_ROOT%{_prefix}/lib/scylla/swagger-ui
cp -r swagger-ui/dist $RPM_BUILD_ROOT%{_prefix}/lib/scylla/swagger-ui
install -d -m755 $RPM_BUILD_ROOT%{_prefix}/lib/scylla/api
@@ -110,8 +111,8 @@ cp -r scylla-housekeeping $RPM_BUILD_ROOT%{_prefix}/lib/scylla/scylla-housekeepi
cp -P dist/common/sbin/* $RPM_BUILD_ROOT%{_sbindir}/
%pre server
/usr/sbin/groupadd scylla 2> /dev/null || :
/usr/sbin/useradd -g scylla -s /sbin/nologin -r -d %{_sharedstatedir}/scylla scylla 2> /dev/null || :
getent group scylla || /usr/sbin/groupadd scylla 2> /dev/null || :
getent passwd scylla || /usr/sbin/useradd -g scylla -s /sbin/nologin -r -d %{_sharedstatedir}/scylla scylla 2> /dev/null || :
%post server
# Upgrade coredump settings
@@ -193,6 +194,7 @@ rm -rf $RPM_BUILD_ROOT
%attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla/data
%attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla/commitlog
%attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla/coredump
%attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla-housekeeping
%package conf
Group: Applications/Databases
@@ -216,7 +218,7 @@ mv /tmp/scylla.yaml /etc/scylla/scylla.yaml
%config(noreplace) %{_sysconfdir}/scylla/scylla.yaml
%config(noreplace) %{_sysconfdir}/scylla/cassandra-rackdc.properties
%if %is_housekeeping_conf
%config(noreplace) %{_sysconfdir}/scylla/housekeeping.cfg
%config(noreplace) %{_sysconfdir}/scylla.d/housekeeping.cfg
%endif

View File

@@ -51,6 +51,9 @@ fi
if [ ! -f /usr/bin/wget ]; then
sudo apt-get -y install wget
fi
if [ ! -f /usr/bin/lsb_release ]; then
sudo apt-get -y install lsb-release
fi
DISTRIBUTION=`lsb_release -i|awk '{print $3}'`
CODENAME=`lsb_release -c|awk '{print $2}'`

View File

@@ -16,7 +16,7 @@ Conflicts: scylla-server (<< 1.1)
Package: scylla-server
Architecture: amd64
Depends: ${shlibs:Depends}, ${misc:Depends}, adduser, hwloc-nox, collectd, scylla-conf, python-yaml, python-urwid, python-requests, curl, util-linux, realpath, python3-yaml, python3, uuid-runtime, @@DEPENDS@@
Depends: ${shlibs:Depends}, ${misc:Depends}, adduser, hwloc-nox, collectd, scylla-conf, python-yaml, python-urwid, python-requests, curl, util-linux, realpath, python3-yaml, python3, uuid-runtime, pciutils, @@DEPENDS@@
Description: Scylla database server binaries
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.

View File

@@ -4,3 +4,4 @@ var/lib/scylla
var/lib/scylla/data
var/lib/scylla/commitlog
var/lib/scylla/coredump
var/lib/scylla-housekeeping

View File

@@ -10,6 +10,7 @@ if [ "$1" = configure ]; then
--disabled-password \
--group scylla
chown -R scylla:scylla /var/lib/scylla
chown -R scylla:scylla /var/lib/scylla-housekeeping
fi
ln -sfT /etc/scylla /var/lib/scylla/conf

View File

@@ -23,13 +23,16 @@ description "A timer job file for running scylla-housekeeping"
start on started scylla-server
stop on stopping scylla-server
setuid scylla
setgid scylla
script
# make sure scylla is up before checking for the version
sleep 5
/usr/lib/scylla/scylla-housekeeping --uuid-file /etc/scylla.d/housekeeping.uuid -c /etc/scylla.d/housekeeping.cfg -q version --mode r || true
/usr/lib/scylla/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid -c /etc/scylla.d/housekeeping.cfg -q version --mode r || true
while [ 1 ]
do
sleep 1d
/usr/lib/scylla/scylla-housekeeping --uuid-file /etc/scylla.d/housekeeping.uuid -c /etc/scylla.d/housekeeping.cfg -q version --mode d || true
/usr/lib/scylla/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid -c /etc/scylla.d/housekeeping.cfg -q version --mode d || true
done
end script

View File

@@ -94,7 +94,7 @@ frozen_mutation::frozen_mutation(const mutation& m)
{
mutation_partition_serializer part_ser(*m.schema(), m.partition());
ser::writer_of_mutation wom(_bytes);
ser::writer_of_mutation<bytes_ostream> wom(_bytes);
std::move(wom).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())
@@ -157,7 +157,7 @@ stop_iteration streamed_mutation_freezer::consume(range_tombstone&& rt) {
frozen_mutation streamed_mutation_freezer::consume_end_of_stream() {
bytes_ostream out;
ser::writer_of_mutation wom(out);
ser::writer_of_mutation<bytes_ostream> wom(out);
std::move(wom).write_table_id(_schema.id())
.write_schema_version(_schema.version())
.write_key(_key)
@@ -192,7 +192,7 @@ class fragmenting_mutation_freezer {
private:
future<> flush() {
bytes_ostream out;
ser::writer_of_mutation wom(out);
ser::writer_of_mutation<bytes_ostream> wom(out);
std::move(wom).write_table_id(_schema.id())
.write_schema_version(_schema.version())
.write_key(_key)

View File

@@ -34,7 +34,7 @@ frozen_schema::frozen_schema(const schema_ptr& s)
: _data([&s] {
schema_mutations sm = db::schema_tables::make_table_mutations(s, api::new_timestamp());
bytes_ostream out;
ser::writer_of_schema wr(out);
ser::writer_of_schema<bytes_ostream> wr(out);
std::move(wr).write_version(s->version())
.write_mutations(sm)
.end_schema();

View File

@@ -303,10 +303,11 @@ def handle_visitors_state(info, hout, clases = []):
name = "__".join(clases) if clases else cls["name"]
frame = "empty_frame" if "final" in cls else "frame"
fprintln(hout, Template("""
template<typename Output>
struct state_of_$name {
$frame f;""").substitute({'name': name, 'frame': frame }))
$frame<Output> f;""").substitute({'name': name, 'frame': frame }))
if clases:
local_state = "state_of_" + "__".join(clases[:-1])
local_state = "state_of_" + "__".join(clases[:-1]) + '<Output>'
fprintln(hout, Template(" $name _parent;").substitute({'name': local_state}))
if "final" in cls:
fprintln(hout, Template(" state_of_$name($state parent) : _parent(parent) {}").substitute({'name': name, 'state' : local_state}))
@@ -327,18 +328,19 @@ def add_vector_node(hout, cls, members, base_state, current_node, ind):
current = members[ind]
typ = current["type"][1]
fprintln(hout, Template("""
template<typename Output>
struct $node_name {
bytes_ostream& _out;
state_of_$base_state _state;
place_holder _size;
Outout& _out;
state_of_$base_state<Output> _state;
place_holder<Output> _size;
size_type _count = 0;
$node_name(bytes_ostream& out, state_of_$base_state state)
$node_name(Output& out, state_of_$base_state<Output> state)
: _out(out)
, _state(state)
, _size(start_place_holder(out))
{
}
$next_state end_$name() {
$next_state<Output> end_$name() {
_size.set(_out, _count);
}""").substitute({'node_name': '', 'name': current["name"] }))
@@ -363,7 +365,7 @@ def optional_add_methods(typ):
}""")).substitute({'type' : added_type})
if is_local_type(typ):
res = res + Template(reindent(4, """
writer_of_$type write() {
writer_of_$type<Output> write() {
serialize(_out, true);
return {_out};
}""")).substitute({'type' : param_type(typ)})
@@ -381,7 +383,7 @@ def vector_add_method(current, base_state):
}""").substitute({'type': param_type(typ[1][0]), 'name': current["name"]})
else:
res = res + Template("""
writer_of_$type add() {
writer_of_$type<Output> add() {
_count++;
return {_out};
}""").substitute({'type': flat_type(typ[1][0]), 'name': current["name"]})
@@ -391,7 +393,7 @@ def vector_add_method(current, base_state):
_count++;
}""").substitute({'type': param_view_type(typ[1][0])})
return res + Template("""
after_${basestate}__$name end_$name() && {
after_${basestate}__$name<Output> end_$name() && {
_size.set(_out, _count);
return { _out, std::move(_state) };
}
@@ -418,7 +420,7 @@ def add_param_writer_basic_type(name, base_state, typ, var_type = "", var_index
typ = 'const ' + typ + '&'
return Template(reindent(4, """
after_${base_state}__$name write_$name$var_type($typ t) && {
after_${base_state}__$name<Output> write_$name$var_type($typ t) && {
$set_varient_index
serialize(_out, t);
$set_command
@@ -431,7 +433,7 @@ def add_param_writer_object(name, base_state, typ, var_type = "", var_index = No
var_index = "uint32_t(" + str(var_index) +")"
set_varient_index = "serialize(_out, " + var_index +");\n" if var_index is not None else ""
ret = Template(reindent(4,"""
${base_state}__${name}$var_type1 start_${name}$var_type() && {
${base_state}__${name}$var_type1<Output> start_${name}$var_type() && {
$set_varient_index
return { _out, std::move(_state) };
}
@@ -443,9 +445,9 @@ def add_param_writer_object(name, base_state, typ, var_type = "", var_index = No
return_command = "{ _out, std::move(_state._parent) }" if var_type is not "" and not root_node else "{ _out, std::move(_state) }"
ret += Template(reindent(4, """
template<typename Serializer>
after_${base_state}__${name} ${name}$var_type(Serializer&& f) && {
after_${base_state}__${name}<Output> ${name}$var_type(Serializer&& f) && {
$set_varient_index
f(writer_of_$typ(_out));
f(writer_of_$typ<Output>(_out));
$set_command
return $return_command;
}""")).substitute(locals())
@@ -458,7 +460,7 @@ def add_param_write(current, base_state, vector = False, root_node = False):
res = res + add_param_writer_basic_type(current["name"], base_state, typ)
elif is_optional(typ):
res = res + Template(reindent(4, """
after_${basestate}__$name skip_$name() && {
after_${basestate}__$name<Output> skip_$name() && {
serialize(_out, false);
return { _out, std::move(_state) };
}""")).substitute({'type': param_type(typ), 'name': current["name"], 'basestate' : base_state})
@@ -472,11 +474,11 @@ def add_param_write(current, base_state, vector = False, root_node = False):
set_size = "_size.set(_out, 0);" if vector else "serialize(_out, size_type(0));"
res = res + Template("""
${basestate}__$name start_$name() && {
${basestate}__$name<Output> start_$name() && {
return { _out, std::move(_state) };
}
after_${basestate}__$name skip_$name() && {
after_${basestate}__$name<Output> skip_$name() && {
$set
return { _out, std::move(_state) };
}
@@ -506,7 +508,7 @@ def get_return_struct(variant_node, clases):
def add_variant_end_method(base_state, name, clases):
return_struct = "after_" + base_state
return_struct = "after_" + base_state + '<Output>'
return Template("""
$return_struct end_$name() && {
_state.f.end(_out);
@@ -520,7 +522,7 @@ def add_end_method(parents, name, variant_node = False, return_value = True):
return add_variant_end_method(parents, name, return_value)
base_state = parents + "__" + name
if return_value:
return_struct = "after_" + base_state
return_struct = "after_" + base_state + '<Output>'
return Template("""
$return_struct end_$name() && {
_state.f.end(_out);
@@ -534,7 +536,7 @@ def add_end_method(parents, name, variant_node = False, return_value = True):
""").substitute({'name': name, 'basestate':base_state})
def add_vector_placeholder():
return """ place_holder _size;
return """ place_holder<Output> _size;
size_type _count = 0;"""
def add_node(hout, name, member, base_state, prefix, parents, fun, is_type_vector = False, is_type_final = False):
@@ -554,21 +556,22 @@ def add_node(hout, name, member, base_state, prefix, parents, fun, is_type_vecto
else:
state_init = ""
if prefix == "writer_of_":
constructor = Template("""$name(bytes_ostream& out)
constructor = Template("""$name(Output& out)
: _out(out)
, _state{start_frame(out)}${vector_init}
{}""").substitute({'name': struct_name, 'vector_init' : vector_init})
elif state_init != "":
constructor = Template("""$name(bytes_ostream& out, state_of_$state state)
constructor = Template("""$name(Output& out, state_of_$state<Output> state)
: _out(out)
, $state_init${vector_init}
{}""").substitute({'name': struct_name, 'vector_init' : vector_init, 'state' : parents, 'state_init' : state_init})
else:
constructor = ""
fprintln(hout, Template("""
template<typename Output>
struct $name {
bytes_ostream& _out;
state_of_$state _state;
Output& _out;
state_of_$state<Output> _state;
${vector_placeholder}
${constructor}
$fun
@@ -588,8 +591,9 @@ def add_optional_node(hout, typ):
return
optional_nodes.add(full_type)
fprintln(hout, Template(reindent(0,"""
template<typename Output>
struct writer_of_$type {
bytes_ostream& _out;
Output& _out;
$add_method
};""")).substitute({'type': full_type, 'add_method': optional_add_methods(typ[1][0])}))
@@ -604,7 +608,7 @@ def add_variant_nodes(hout, member, param, base_state, parents, classes):
new_member = {"type": typ, "name" : "variant"}
return_struct = "after_" + par
end_method = Template("""
$return_struct end_variant() && {
$return_struct<Output> end_variant() && {
_state.f.end(_out);
return { _out, std::move(_state._parent) };
}

View File

@@ -27,5 +27,5 @@ class partition {
class reconcilable_result {
uint32_t row_count();
std::vector<partition> partitions();
query::short_read is_short_read() [[version 1.7]] = query::short_read::no;
query::short_read is_short_read() [[version 1.6]] = query::short_read::no;
};

View File

@@ -29,7 +29,7 @@ class result {
bytes buf();
std::experimental::optional<query::result_digest> digest();
api::timestamp_type last_modified() [ [version 1.2] ] = api::missing_timestamp;
query::short_read is_short_read() [[version 1.7]] = query::short_read::no;
query::short_read is_short_read() [[version 1.6]] = query::short_read::no;
};
}

12
main.cc
View File

@@ -426,6 +426,8 @@ int main(int ac, char** av) {
if (opts.count("developer-mode")) {
smp::invoke_on_all([] { engine().set_strict_dma(false); }).get();
}
supervisor_notify("creating tracing");
tracing::tracing::create_tracing("trace_keyspace_helper").get();
supervisor_notify("creating snitch");
i_endpoint_snitch::create_snitch(cfg->endpoint_snitch()).get();
// #293 - do not stop anything
@@ -580,7 +582,12 @@ int main(int ac, char** av) {
// we will have races between the compaction and loading processes
// We also want to trigger regular compaction on boot.
db.invoke_on_all([&proxy] (database& db) {
for (auto& x : db.get_column_families()) {
// avoid excessive disk usage by making sure all shards reshard
// shared sstables in the same order. That's done by choosing
// column families in UUID order, and each individual column
// family will reshard shared sstables in generation order.
auto cfs = boost::copy_range<std::map<utils::UUID, lw_shared_ptr<column_family>>>(db.get_column_families());
for (auto& x : cfs) {
column_family& cf = *(x.second);
// We start the rewrite, but do not wait for it.
cf.start_rewrite();
@@ -632,7 +639,7 @@ int main(int ac, char** av) {
gms::get_local_gossiper().wait_for_gossip_to_settle().get();
api::set_server_gossip_settle(ctx).get();
supervisor_notify("starting tracing");
tracing::tracing::create_tracing("trace_keyspace_helper").get();
tracing::tracing::start_tracing().get();
supervisor_notify("starting native transport");
service::get_local_storage_service().start_native_transport().get();
if (start_thrift) {
@@ -659,6 +666,7 @@ int main(int ac, char** av) {
uint16_t pport = cfg->prometheus_port();
if (pport) {
pctx.metric_help = "Scylla server statistics";
pctx.prefix = cfg->prometheus_prefix();
prometheus_server.start().get();
prometheus::start(prometheus_server, pctx);
prometheus_server.listen(ipv4_addr{prom_addr.addresses[0].in.s_addr, pport}).get();

View File

@@ -296,9 +296,10 @@ messaging_service::messaging_service(gms::inet_address ip
_rpc->set_logger([] (const sstring& log) {
rpc_logger.info("{}", log);
});
register_handler(this, messaging_verb::CLIENT_ID, [] (rpc::client_info& ci, gms::inet_address broadcast_address, uint32_t src_cpu_id) {
register_handler(this, messaging_verb::CLIENT_ID, [] (rpc::client_info& ci, gms::inet_address broadcast_address, uint32_t src_cpu_id, rpc::optional<uint64_t> max_result_size) {
ci.attach_auxiliary("baddr", broadcast_address);
ci.attach_auxiliary("src_cpu_id", src_cpu_id);
ci.attach_auxiliary("max_result_size", max_result_size.value_or(query::result_memory_limiter::maximum_result_size));
return rpc::no_wait;
});
@@ -385,6 +386,8 @@ static unsigned get_rpc_client_idx(messaging_verb verb) {
verb == messaging_verb::STREAM_MUTATION_DONE ||
verb == messaging_verb::COMPLETE_MESSAGE) {
idx = 2;
} else if (verb == messaging_verb::MUTATION_DONE) {
idx = 3;
}
return idx;
}
@@ -499,7 +502,8 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
it = _clients[idx].emplace(id, shard_info(std::move(client))).first;
uint32_t src_cpu_id = engine().cpu_id();
_rpc->make_client<rpc::no_wait_type(gms::inet_address, uint32_t)>(messaging_verb::CLIENT_ID)(*it->second.rpc_client, utils::fb_utilities::get_broadcast_address(), src_cpu_id);
_rpc->make_client<rpc::no_wait_type(gms::inet_address, uint32_t, uint64_t)>(messaging_verb::CLIENT_ID)(*it->second.rpc_client, utils::fb_utilities::get_broadcast_address(), src_cpu_id,
query::result_memory_limiter::maximum_result_size);
return it->second.rpc_client;
}

View File

@@ -196,7 +196,7 @@ private:
std::array<std::unique_ptr<rpc_protocol_server_wrapper>, 2> _server;
::shared_ptr<seastar::tls::server_credentials> _credentials;
std::array<std::unique_ptr<rpc_protocol_server_wrapper>, 2> _server_tls;
std::array<clients_map, 3> _clients;
std::array<clients_map, 4> _clients;
uint64_t _dropped_messages[static_cast<int32_t>(messaging_verb::LAST)] = {};
bool _stopping = false;
public:

View File

@@ -256,9 +256,7 @@ mutation_partition::mutation_partition(const mutation_partition& x, const schema
try {
for(auto&& r : ck_ranges) {
for (const rows_entry& e : x.range(schema, r)) {
std::unique_ptr<rows_entry> copy(current_allocator().construct<rows_entry>(e));
_rows.insert(_rows.end(), *copy);
copy.release();
_rows.push_back(*current_allocator().construct<rows_entry>(e));
}
}
} catch (...) {
@@ -578,7 +576,7 @@ void mutation_partition::for_each_row(const schema& schema, const query::cluster
template<typename RowWriter>
void write_cell(RowWriter& w, const query::partition_slice& slice, ::atomic_cell_view c) {
assert(c.is_live());
ser::writer_of_qr_cell wr = w.add().write();
auto wr = w.add().write();
auto after_timestamp = [&, wr = std::move(wr)] () mutable {
if (slice.options.contains<query::partition_slice::option::send_timestamp>()) {
return std::move(wr).write_timestamp(c.timestamp());
@@ -769,13 +767,14 @@ mutation_partition::query_compacted(query::result::partition_writer& pw, const s
// If ck:s exist, and we do a restriction on them, we either have maching
// rows, or return nothing, since cql does not allow "is null".
if (row_count == 0
&& (has_ck_selector(pw.ranges())
|| !has_any_live_data(s, column_kind::static_column, static_row()))) {
pw.retract();
} else {
pw.row_count() += row_count ? : 1;
&& (has_ck_selector(pw.ranges())
|| !has_any_live_data(s, column_kind::static_column, static_row()))) {
pw.retract();
} else {
pw.row_count() += row_count ? : 1;
pw.partition_count() += 1;
std::move(rows_wr).end_rows().end_qr_partition();
}
}
}
std::ostream&
@@ -1670,10 +1669,11 @@ class mutation_querier {
const schema& _schema;
query::result_memory_accounter& _memory_accounter;
query::result::partition_writer& _pw;
ser::qr_partition__static_row__cells _static_cells_wr;
ser::qr_partition__static_row__cells<bytes_ostream> _static_cells_wr;
bool _live_data_in_static_row{};
uint32_t _live_clustering_rows = 0;
stdx::optional<ser::qr_partition__rows> _rows_wr;
stdx::optional<ser::qr_partition__rows<bytes_ostream>> _rows_wr;
bool _short_reads_allowed;
private:
void query_static_row(const row& r, tombstone current_tombstone);
void prepare_writers();
@@ -1695,6 +1695,7 @@ mutation_querier::mutation_querier(const schema& s, query::result::partition_wri
, _memory_accounter(memory_accounter)
, _pw(pw)
, _static_cells_wr(pw.start().start_static_row().start_cells())
, _short_reads_allowed(pw.slice().options.contains<query::partition_slice::option::allow_short_read>())
{
}
@@ -1706,7 +1707,13 @@ void mutation_querier::query_static_row(const row& r, tombstone current_tombston
auto start = _static_cells_wr._out.size();
get_compacted_row_slice(_schema, slice, column_kind::static_column,
r, slice.static_columns, _static_cells_wr);
_memory_accounter.update_and_check(_static_cells_wr._out.size() - start);
_memory_accounter.update(_static_cells_wr._out.size() - start);
} else if (_short_reads_allowed) {
seastar::measuring_output_stream stream;
ser::qr_partition__static_row__cells<seastar::measuring_output_stream> out(stream, { });
get_compacted_row_slice(_schema, slice, column_kind::static_column,
r, slice.static_columns, _static_cells_wr);
_memory_accounter.update(stream.size());
}
if (_pw.requested_digest()) {
::feed_hash(_pw.digest(), current_tombstone);
@@ -1744,23 +1751,32 @@ stop_iteration mutation_querier::consume(clustering_row&& cr, tombstone current_
_pw.last_modified() = std::max({_pw.last_modified(), current_tombstone.timestamp, t});
}
auto stop = stop_iteration::no;
if (_pw.requested_result()) {
auto start = _rows_wr->_out.size();
auto write_row = [&] (auto& rows_writer) {
auto cells_wr = [&] {
if (slice.options.contains(query::partition_slice::option::send_clustering_key)) {
return _rows_wr->add().write_key(cr.key()).start_cells().start_cells();
return rows_writer.add().write_key(cr.key()).start_cells().start_cells();
} else {
return _rows_wr->add().skip_key().start_cells().start_cells();
return rows_writer.add().skip_key().start_cells().start_cells();
}
}();
get_compacted_row_slice(_schema, slice, column_kind::regular_column, cr.cells(), slice.regular_columns, cells_wr);
std::move(cells_wr).end_cells().end_cells().end_qr_clustered_row();
};
auto stop = stop_iteration::no;
if (_pw.requested_result()) {
auto start = _rows_wr->_out.size();
write_row(*_rows_wr);
stop = _memory_accounter.update_and_check(_rows_wr->_out.size() - start);
} else if (_short_reads_allowed) {
seastar::measuring_output_stream stream;
ser::qr_partition__rows<seastar::measuring_output_stream> out(stream, { });
write_row(out);
stop = _memory_accounter.update_and_check(stream.size());
}
_live_clustering_rows++;
return stop;
return stop && stop_iteration(_short_reads_allowed);
}
uint32_t mutation_querier::consume_end_of_stream() {
@@ -1776,15 +1792,16 @@ uint32_t mutation_querier::consume_end_of_stream() {
_pw.retract();
return 0;
} else {
auto live_rows = std::max(_live_clustering_rows, uint32_t(1));
_pw.row_count() += live_rows;
_pw.partition_count() += 1;
std::move(*_rows_wr).end_rows().end_qr_partition();
return std::max(_live_clustering_rows, uint32_t(1));
return live_rows;
}
}
class query_result_builder {
const schema& _schema;
uint32_t _live_rows = 0;
uint32_t _partitions = 0;
query::result::builder& _rb;
stdx::optional<query::result::partition_writer> _pw;
stdx::optional<mutation_querier> _mutation_consumer;
@@ -1819,9 +1836,7 @@ public:
stop_iteration consume_end_of_partition() {
auto live_rows_in_partition = _mutation_consumer->consume_end_of_stream();
_live_rows += live_rows_in_partition;
_partitions += live_rows_in_partition > 0;
if (live_rows_in_partition && !_stop) {
if (_short_read_allowed && live_rows_in_partition > 0 && !_stop) {
_stop = _rb.memory_accounter().check();
}
if (_stop) {
@@ -1830,17 +1845,17 @@ public:
return _stop;
}
data_query_result consume_end_of_stream() {
return {_live_rows, _partitions};
void consume_end_of_stream() {
}
};
future<data_query_result> data_query(schema_ptr s, const mutation_source& source, const query::partition_range& range,
const query::partition_slice& slice, uint32_t row_limit, uint32_t partition_limit,
gc_clock::time_point query_time, query::result::builder& builder)
future<> data_query(
schema_ptr s, const mutation_source& source, const query::partition_range& range,
const query::partition_slice& slice, uint32_t row_limit, uint32_t partition_limit,
gc_clock::time_point query_time, query::result::builder& builder)
{
if (row_limit == 0 || slice.partition_row_limit() == 0 || partition_limit == 0) {
return make_ready_future<data_query_result>();
return make_ready_future<>();
}
auto is_reversed = slice.options.contains(query::partition_slice::option::reversed);
@@ -1918,7 +1933,7 @@ public:
// well. Next page fetch will ask for the next partition and if we
// don't do that we could end up with an unbounded number of
// partitions with only a static row.
_stop = _stop || _memory_accounter.check();
_stop = _stop || (_memory_accounter.check() && stop_iteration(_short_read_allowed));
}
_total_live_rows += _live_rows;
_result.emplace_back(partition { _live_rows, _mutation_consumer->consume_end_of_stream() });

View File

@@ -189,17 +189,17 @@ mutation_partition_serializer::mutation_partition_serializer(const schema& schem
void
mutation_partition_serializer::write(bytes_ostream& out) const {
write(ser::writer_of_mutation_partition(out));
write(ser::writer_of_mutation_partition<bytes_ostream>(out));
}
void mutation_partition_serializer::write(ser::writer_of_mutation_partition&& wr) const
void mutation_partition_serializer::write(ser::writer_of_mutation_partition<bytes_ostream>&& wr) const
{
write_serialized(std::move(wr), _schema, _p);
}
void serialize_mutation_fragments(const schema& s, tombstone partition_tombstone,
stdx::optional<static_row> sr, range_tombstone_list rts,
std::deque<clustering_row> crs, ser::writer_of_mutation_partition&& wr)
std::deque<clustering_row> crs, ser::writer_of_mutation_partition<bytes_ostream>&& wr)
{
auto srow_writer = std::move(wr).write_tomb(partition_tombstone).start_static_row();
auto row_tombstones = [&] {

View File

@@ -29,6 +29,7 @@
#include "streamed_mutation.hh"
namespace ser {
template<typename Output>
class writer_of_mutation_partition;
}
@@ -47,9 +48,9 @@ public:
mutation_partition_serializer(const schema&, const mutation_partition&);
public:
void write(bytes_ostream&) const;
void write(ser::writer_of_mutation_partition&&) const;
void write(ser::writer_of_mutation_partition<bytes_ostream>&&) const;
};
void serialize_mutation_fragments(const schema& s, tombstone partition_tombstone,
stdx::optional<static_row> sr, range_tombstone_list range_tombstones,
std::deque<clustering_row> clustering_rows, ser::writer_of_mutation_partition&&);
std::deque<clustering_row> clustering_rows, ser::writer_of_mutation_partition<bytes_ostream>&&);

View File

@@ -57,13 +57,14 @@ bool reconcilable_result::operator!=(const reconcilable_result& other) const {
}
query::result
to_data_query_result(const reconcilable_result& r, schema_ptr s, const query::partition_slice& slice, uint32_t max_partitions) {
to_data_query_result(const reconcilable_result& r, schema_ptr s, const query::partition_slice& slice, uint32_t max_rows, uint32_t max_partitions) {
query::result::builder builder(slice, query::result_request::only_result, { });
for (const partition& p : r.partitions()) {
if (!max_partitions--) {
if (builder.row_count() >= max_rows || builder.partition_count() >= max_partitions) {
break;
}
p.mut().unfreeze(s).query(builder, slice, gc_clock::time_point::min(), query::max_rows);
// Also enforces the per-partition limit.
p.mut().unfreeze(s).query(builder, slice, gc_clock::time_point::min(), max_rows - builder.row_count());
}
if (r.is_short_read()) {
builder.mark_as_short_read();

View File

@@ -105,7 +105,7 @@ public:
printer pretty_printer(schema_ptr) const;
};
query::result to_data_query_result(const reconcilable_result&, schema_ptr, const query::partition_slice&, uint32_t partition_limit = query::max_partitions);
query::result to_data_query_result(const reconcilable_result&, schema_ptr, const query::partition_slice&, uint32_t row_limit, uint32_t partition_limit);
// Performs a query on given data source returning data in reconcilable form.
//
@@ -128,11 +128,7 @@ future<reconcilable_result> mutation_query(
gc_clock::time_point query_time,
query::result_memory_accounter&& accounter = { });
struct data_query_result {
uint32_t live_rows{0};
uint32_t partitions{0};
};
future<data_query_result> data_query(schema_ptr s, const mutation_source& source, const query::partition_range& range,
const query::partition_slice& slice, uint32_t row_limit, uint32_t partition_limit,
gc_clock::time_point query_time, query::result::builder& builder);
future<> data_query(
schema_ptr s, const mutation_source& source, const query::partition_range& range,
const query::partition_slice& slice, uint32_t row_limit, uint32_t partition_limit,
gc_clock::time_point query_time, query::result::builder& builder);

View File

@@ -40,29 +40,31 @@ namespace query {
class result::partition_writer {
result_request _request;
ser::after_qr_partition__key _w;
ser::after_qr_partition__key<bytes_ostream> _w;
const partition_slice& _slice;
// We are tasked with keeping track of the range
// as well, since we are the primary "context"
// when iterating "inside" a partition
const clustering_row_ranges& _ranges;
ser::query_result__partitions& _pw;
ser::query_result__partitions<bytes_ostream>& _pw;
ser::vector_position _pos;
bool _static_row_added = false;
md5_hasher& _digest;
md5_hasher _digest_pos;
uint32_t& _row_count;
uint32_t& _partition_count;
api::timestamp_type& _last_modified;
public:
partition_writer(
result_request request,
const partition_slice& slice,
const clustering_row_ranges& ranges,
ser::query_result__partitions& pw,
ser::query_result__partitions<bytes_ostream>& pw,
ser::vector_position pos,
ser::after_qr_partition__key w,
ser::after_qr_partition__key<bytes_ostream> w,
md5_hasher& digest,
uint32_t& row_count,
uint32_t& partition_count,
api::timestamp_type& last_modified)
: _request(request)
, _w(std::move(w))
@@ -73,6 +75,7 @@ public:
, _digest(digest)
, _digest_pos(digest)
, _row_count(row_count)
, _partition_count(partition_count)
, _last_modified(last_modified)
{ }
@@ -84,7 +87,7 @@ public:
return _request != result_request::only_digest;
}
ser::after_qr_partition__key start() {
ser::after_qr_partition__key<bytes_ostream> start() {
return std::move(_w);
}
@@ -108,6 +111,9 @@ public:
uint32_t& row_count() {
return _row_count;
}
uint32_t& partition_count() {
return _partition_count;
}
api::timestamp_type& last_modified() {
return _last_modified;
}
@@ -118,16 +124,17 @@ class result::builder {
bytes_ostream _out;
md5_hasher _digest;
const partition_slice& _slice;
ser::query_result__partitions _w;
ser::query_result__partitions<bytes_ostream> _w;
result_request _request;
uint32_t _row_count = 0;
uint32_t _partition_count = 0;
api::timestamp_type _last_modified = api::missing_timestamp;
short_read _short_read;
result_memory_accounter _memory_accounter;
public:
builder(const partition_slice& slice, result_request request, result_memory_accounter memory_accounter)
: _slice(slice)
, _w(ser::writer_of_query_result(_out).start_partitions())
, _w(ser::writer_of_query_result<bytes_ostream>(_out).start_partitions())
, _request(request)
, _memory_accounter(std::move(memory_accounter))
{ }
@@ -140,6 +147,14 @@ public:
const partition_slice& slice() const { return _slice; }
uint32_t row_count() const {
return _row_count;
}
uint32_t partition_count() const {
return _partition_count;
}
// Starts new partition and returns a builder for its contents.
// Invalidates all previously obtained builders
partition_writer add_partition(const schema& s, const partition_key& key) {
@@ -156,22 +171,23 @@ public:
if (_request != result_request::only_result) {
key.feed_hash(_digest, s);
}
return partition_writer(_request, _slice, ranges, _w, std::move(pos), std::move(after_key), _digest, _row_count, _last_modified);
return partition_writer(_request, _slice, ranges, _w, std::move(pos), std::move(after_key), _digest, _row_count,
_partition_count, _last_modified);
}
result build() {
std::move(_w).end_partitions().end_query_result();
switch (_request) {
case result_request::only_result:
return result(std::move(_out), _short_read, _row_count, std::move(_memory_accounter).done());
return result(std::move(_out), _short_read, _row_count, _partition_count, std::move(_memory_accounter).done());
case result_request::only_digest: {
bytes_ostream buf;
ser::writer_of_query_result(buf).start_partitions().end_partitions().end_query_result();
ser::writer_of_query_result<bytes_ostream>(buf).start_partitions().end_partitions().end_query_result();
return result(std::move(buf), result_digest(_digest.finalize_array()), _last_modified, _short_read);
}
case result_request::result_and_digest:
return result(std::move(_out), result_digest(_digest.finalize_array()),
_last_modified, _short_read, _row_count, std::move(_memory_accounter).done());
_last_modified, _short_read, _row_count, _partition_count, std::move(_memory_accounter).done());
}
abort();
}

View File

@@ -67,8 +67,21 @@ public:
return _maximum_total_result_memory - _memory_limiter.available_units();
}
// Reserves minimum_result_size and creates new memory accounter.
future<result_memory_accounter> new_read();
// Reserves minimum_result_size and creates new memory accounter for
// mutation query. Uses the specified maximum result size and may be
// stopped before reaching it due to memory pressure on shard.
future<result_memory_accounter> new_mutation_read(size_t max_result_size);
// Reserves minimum_result_size and creates new memory accounter for
// data query. Uses the specified maximum result size, result will *not*
// be stopped due to on shard memory pressure in order to avoid digest
// mismatches.
future<result_memory_accounter> new_data_read(size_t max_result_size);
// Creates a memory accounter for digest reads. Such accounter doesn't
// contribute to the shard memory usage, but still stops producing the
// result after individual limit has been reached.
future<result_memory_accounter> new_digest_read(size_t max_result_size);
// Checks whether the result can grow any more, takes into account only
// the per shard limit.
@@ -108,12 +121,50 @@ class result_memory_accounter {
size_t _blocked_bytes = 0;
size_t _used_memory = 0;
size_t _total_used_memory = 0;
size_t _maximum_result_size = 0;
stop_iteration _stop_on_global_limit;
private:
explicit result_memory_accounter(result_memory_limiter& limiter) noexcept
// Mutation query accounter. Uses provided individual result size limit and
// will stop when shard memory pressure grows too high.
struct mutation_query_tag { };
explicit result_memory_accounter(mutation_query_tag, result_memory_limiter& limiter, size_t max_size) noexcept
: _limiter(&limiter)
, _blocked_bytes(result_memory_limiter::minimum_result_size)
, _maximum_result_size(max_size)
, _stop_on_global_limit(true)
{ }
// Data query accounter. Uses provided individual result size limit and
// will *not* stop even though shard memory pressure grows too high.
struct data_query_tag { };
explicit result_memory_accounter(data_query_tag, result_memory_limiter& limiter, size_t max_size) noexcept
: _limiter(&limiter)
, _blocked_bytes(result_memory_limiter::minimum_result_size)
, _maximum_result_size(max_size)
{ }
// Digest query accounter. Uses provided individual result size limit and
// will *not* stop even though shard memory pressure grows too high. This
// accounter does not contribute to the shard memory limits.
struct digest_query_tag { };
explicit result_memory_accounter(digest_query_tag, result_memory_limiter&, size_t max_size) noexcept
: _blocked_bytes(0)
, _maximum_result_size(max_size)
{ }
friend class result_memory_limiter;
public:
// State of a accounter on another shard. Used to pass information about
// the size of the result so far in range queries.
class foreign_state {
size_t _used_memory;
size_t _max_result_size;
public:
foreign_state(size_t used_mem, size_t max_result_size)
: _used_memory(used_mem), _max_result_size(max_result_size) { }
size_t used_memory() const { return _used_memory; }
size_t max_result_size() const { return _max_result_size; }
};
public:
result_memory_accounter() = default;
@@ -123,9 +174,10 @@ public:
// accouter will learn how big the total result alread is and limit the
// part produced on this shard so that after merging the final result
// does not exceed the individual limit.
result_memory_accounter(result_memory_limiter& limiter, const result_memory_accounter& foreign_accounter) noexcept
result_memory_accounter(result_memory_limiter& limiter, foreign_state fstate) noexcept
: _limiter(&limiter)
, _total_used_memory(foreign_accounter.used_memory())
, _total_used_memory(fstate.used_memory())
, _maximum_result_size(fstate.max_result_size())
{ }
result_memory_accounter(result_memory_accounter&& other) noexcept
@@ -133,6 +185,8 @@ public:
, _blocked_bytes(other._blocked_bytes)
, _used_memory(other._used_memory)
, _total_used_memory(other._total_used_memory)
, _maximum_result_size(other._maximum_result_size)
, _stop_on_global_limit(other._stop_on_global_limit)
{ }
result_memory_accounter& operator=(result_memory_accounter&& other) noexcept {
@@ -151,17 +205,21 @@ public:
size_t used_memory() const { return _used_memory; }
foreign_state state_for_another_shard() {
return foreign_state(_used_memory, _maximum_result_size);
}
// Consume n more bytes for the result. Returns stop_iteration::yes if
// the result cannot grow any more (taking into account both individual
// and per-shard limits).
stop_iteration update_and_check(size_t n) {
_used_memory += n;
_total_used_memory += n;
auto stop = stop_iteration(_total_used_memory > result_memory_limiter::maximum_result_size);
auto stop = stop_iteration(_total_used_memory > _maximum_result_size);
if (_limiter && _used_memory > _blocked_bytes) {
auto to_block = std::min(_used_memory - _blocked_bytes, n);
_blocked_bytes += to_block;
stop = _limiter->update_and_check(to_block) || stop;
stop = (_limiter->update_and_check(to_block) && _stop_on_global_limit) || stop;
}
return stop;
}
@@ -170,7 +228,7 @@ public:
stop_iteration check() const {
stop_iteration stop { _total_used_memory > result_memory_limiter::maximum_result_size };
if (!stop && _used_memory >= _blocked_bytes && _limiter) {
return _limiter->check();
return _limiter->check() && _stop_on_global_limit;
}
return stop;
}
@@ -189,12 +247,22 @@ public:
}
};
inline future<result_memory_accounter> result_memory_limiter::new_read() {
return _memory_limiter.wait(minimum_result_size).then([this] {
return result_memory_accounter(*this);
inline future<result_memory_accounter> result_memory_limiter::new_mutation_read(size_t max_size) {
return _memory_limiter.wait(minimum_result_size).then([this, max_size] {
return result_memory_accounter(result_memory_accounter::mutation_query_tag(), *this, max_size);
});
}
inline future<result_memory_accounter> result_memory_limiter::new_data_read(size_t max_size) {
return _memory_limiter.wait(minimum_result_size).then([this, max_size] {
return result_memory_accounter(result_memory_accounter::data_query_tag(), *this, max_size);
});
}
inline future<result_memory_accounter> result_memory_limiter::new_digest_read(size_t max_size) {
return make_ready_future<result_memory_accounter>(result_memory_accounter(result_memory_accounter::digest_query_tag(), *this, max_size));
}
enum class result_request {
only_result,
only_digest,
@@ -265,29 +333,32 @@ class result {
api::timestamp_type _last_modified = api::missing_timestamp;
short_read _short_read;
query::result_memory_tracker _memory_tracker;
stdx::optional<uint32_t> _partition_count;
public:
class builder;
class partition_writer;
friend class result_merger;
result();
result(bytes_ostream&& w, short_read sr, stdx::optional<uint32_t> c = { },
result(bytes_ostream&& w, short_read sr, stdx::optional<uint32_t> c = { }, stdx::optional<uint32_t> pc = { },
result_memory_tracker memory_tracker = { })
: _w(std::move(w))
, _row_count(c)
, _short_read(sr)
, _memory_tracker(std::move(_memory_tracker))
, _partition_count(pc)
{
w.reduce_chunk_count();
}
result(bytes_ostream&& w, stdx::optional<result_digest> d, api::timestamp_type last_modified,
short_read sr, stdx::optional<uint32_t> c = { }, result_memory_tracker memory_tracker = { })
short_read sr, stdx::optional<uint32_t> c = { }, stdx::optional<uint32_t> pc = { }, result_memory_tracker memory_tracker = { })
: _w(std::move(w))
, _digest(d)
, _row_count(c)
, _last_modified(last_modified)
, _short_read(sr)
, _memory_tracker(std::move(memory_tracker))
, _partition_count(pc)
{
w.reduce_chunk_count();
}
@@ -316,7 +387,11 @@ public:
return _short_read;
}
uint32_t calculate_row_count(const query::partition_slice&);
const stdx::optional<uint32_t>& partition_count() const {
return _partition_count;
}
void calculate_counts(const query::partition_slice&);
struct printer {
schema_ptr s;

View File

@@ -32,6 +32,9 @@
namespace query {
constexpr size_t result_memory_limiter::minimum_result_size;
constexpr size_t result_memory_limiter::maximum_result_size;
thread_local semaphore result_memory_tracker::_dummy { 0 };
const partition_range full_partition_range = partition_range::make_open_ended_both_sides();
@@ -161,16 +164,18 @@ std::ostream& operator<<(std::ostream& os, const query::result::printer& p) {
return os;
}
uint32_t result::calculate_row_count(const query::partition_slice& slice) {
void result::calculate_counts(const query::partition_slice& slice) {
struct {
uint32_t total_count = 0;
uint32_t current_partition_count = 0;
uint32_t live_partitions = 0;
void accept_new_partition(const partition_key& key, uint32_t row_count) {
accept_new_partition(row_count);
}
void accept_new_partition(uint32_t row_count) {
total_count += row_count;
current_partition_count = row_count;
live_partitions += 1;
}
void accept_new_row(const clustering_key& key, const result_row_view& static_row, const result_row_view& row) {}
void accept_new_row(const result_row_view& static_row, const result_row_view& row) {}
@@ -182,44 +187,78 @@ uint32_t result::calculate_row_count(const query::partition_slice& slice) {
} counter;
result_view::consume(*this, slice, counter);
return counter.total_count;
_row_count = counter.total_count;
_partition_count = counter.live_partitions;
}
result::result()
: result([] {
bytes_ostream out;
ser::writer_of_query_result(out).skip_partitions().end_query_result();
ser::writer_of_query_result<bytes_ostream>(out).skip_partitions().end_query_result();
return out;
}(), short_read::no)
}(), short_read::no, 0, 0)
{ }
static void write_partial_partition(ser::writer_of_qr_partition<bytes_ostream>&& pw, const ser::qr_partition_view& pv, uint32_t rows_to_include) {
auto key = pv.key();
auto static_cells_wr = (key ? std::move(pw).write_key(*key) : std::move(pw).skip_key())
.start_static_row()
.start_cells();
for (auto&& cell : pv.static_row().cells()) {
static_cells_wr.add(cell);
}
auto rows_wr = std::move(static_cells_wr)
.end_cells()
.end_static_row()
.start_rows();
auto rows = pv.rows();
// rows.size() can be 0 is there's a single static row
auto it = rows.begin();
for (uint32_t i = 0; i < std::min(rows.size(), uint64_t{rows_to_include}); ++i) {
rows_wr.add(*it++);
}
std::move(rows_wr).end_rows().end_qr_partition();
}
foreign_ptr<lw_shared_ptr<query::result>> result_merger::get() {
if (_partial.size() == 1) {
return std::move(_partial[0]);
}
bytes_ostream w;
auto partitions = ser::writer_of_query_result(w).start_partitions();
std::experimental::optional<uint32_t> row_count = 0;
auto partitions = ser::writer_of_query_result<bytes_ostream>(w).start_partitions();
uint32_t row_count = 0;
short_read is_short_read;
uint32_t partition_count = 0;
for (auto&& r : _partial) {
if (row_count) {
if (r->row_count()) {
row_count = row_count.value() + r->row_count().value();
} else {
row_count = std::experimental::nullopt;
}
}
result_view::do_with(*r, [&] (result_view rv) {
for (auto&& pv : rv._v.partitions()) {
partitions.add(pv);
auto rows = pv.rows();
// If rows.empty(), then there's a static row, or there wouldn't be a partition
const uint32_t rows_in_partition = rows.size() ? : 1;
const uint32_t rows_to_include = std::min(_max_rows - row_count, rows_in_partition);
row_count += rows_to_include;
if (rows_to_include >= rows_in_partition) {
partitions.add(pv);
if (++partition_count >= _max_partitions) {
return;
}
} else if (rows_to_include > 0) {
write_partial_partition(partitions.add(), pv, rows_to_include);
return;
} else {
return;
}
}
});
if (r->is_short_read()) {
is_short_read = short_read::yes;
break;
}
if (row_count >= _max_rows || partition_count >= _max_partitions) {
break;
}
}
std::move(partitions).end_partitions().end_query_result();

View File

@@ -30,7 +30,14 @@ namespace query {
// Implements @Reducer concept from distributed.hh
class result_merger {
std::vector<foreign_ptr<lw_shared_ptr<query::result>>> _partial;
const uint32_t _max_rows;
const uint32_t _max_partitions;
public:
explicit result_merger(uint32_t max_rows, uint32_t max_partitions)
: _max_rows(max_rows)
, _max_partitions(max_partitions)
{ }
void reserve(size_t size) {
_partial.reserve(size);
}

View File

@@ -30,6 +30,7 @@ import ConfigParser
import os
import sys
import subprocess
import uuid
from pkg_resources import parse_version
VERSION = "1.0"
@@ -64,6 +65,10 @@ def get_api(path):
def version_compare(a, b):
return parse_version(a) < parse_version(b)
def create_uuid_file(fl):
with open(args.uuid_file, 'w') as myfile:
myfile.write(str(uuid.uuid1()) + "\n")
def check_version(ar):
if config and (not config.has_option("housekeeping", "check-version") or not config.getboolean("housekeeping", "check-version")):
return
@@ -80,8 +85,8 @@ def check_version(ar):
# mode would accept any string.
# use i for install, c (default) for running from the command line
params = params + "&sts=" + ar.mode
if uuid:
params = params + "&uu=" + uuid
if uid:
params = params + "&uu=" + uid
latest_version = get_json_from_url(version_url + params)["version"]
except:
traceln("Unable to retrieve version information")
@@ -112,10 +117,12 @@ if args.config != "":
sys.exit(0)
config = ConfigParser.SafeConfigParser()
config.read(args.config)
uuid = None
uid = None
if args.uuid != "":
uuid = args.uuid
if args.uuid_file != "" and os.path.exists(args.uuid_file):
uid = args.uuid
if args.uuid_file != "":
if not os.path.exists(args.uuid_file):
create_uuid_file(args.uuid_file)
with open(args.uuid_file, 'r') as myfile:
uuid = myfile.read().replace('\n', '')
uid = myfile.read().replace('\n', '')
args.func(args)

Submodule seastar updated: 0b98024073...0bfd7fe517

View File

@@ -27,8 +27,14 @@ namespace ser {
// frame represents a place holder for object size which will be known later
template<typename Output>
struct place_holder { };
struct place_holder {
template<typename Output>
struct frame { };
template<>
struct place_holder<bytes_ostream> {
bytes_ostream::place_holder<size_type> ph;
place_holder(bytes_ostream::place_holder<size_type> ph) : ph(ph) { }
@@ -39,7 +45,8 @@ struct place_holder {
}
};
struct frame : public place_holder {
template<>
struct frame<bytes_ostream> : public place_holder<bytes_ostream> {
bytes_ostream::size_type offset;
frame(bytes_ostream::place_holder<size_type> ph, bytes_ostream::size_type offset)
@@ -56,25 +63,26 @@ struct vector_position {
};
//empty frame, behave like a place holder, but is used when no place holder is needed
template<typename Output>
struct empty_frame {
void end(bytes_ostream&) {}
void end(Output&) {}
empty_frame() = default;
empty_frame(const frame&){}
empty_frame(const frame<Output>&){}
};
inline place_holder start_place_holder(bytes_ostream& out) {
inline place_holder<bytes_ostream> start_place_holder(bytes_ostream& out) {
auto size_ph = out.write_place_holder<size_type>();
return { size_ph};
}
inline frame start_frame(bytes_ostream& out) {
inline frame<bytes_ostream> start_frame(bytes_ostream& out) {
auto offset = out.size();
auto size_ph = out.write_place_holder<size_type>();
{
auto out = size_ph.get_stream();
serialize(out, (size_type)0);
}
return frame { size_ph, offset };
return frame<bytes_ostream> { size_ph, offset };
}
template<typename Input>
@@ -86,4 +94,25 @@ size_type read_frame_size(Input& in) {
return sz - sizeof(size_type);
}
template<>
struct place_holder<seastar::measuring_output_stream> {
void set(seastar::measuring_output_stream&, size_type) { }
};
template<>
struct frame<seastar::measuring_output_stream> : public place_holder<seastar::measuring_output_stream> {
void end(seastar::measuring_output_stream& out) { }
};
inline place_holder<seastar::measuring_output_stream> start_place_holder(seastar::measuring_output_stream& out) {
serialize(out, size_type());
return { };
}
inline frame<seastar::measuring_output_stream> start_frame(seastar::measuring_output_stream& out) {
serialize(out, size_type());
return { };
}
}

View File

@@ -64,7 +64,7 @@ public:
, _ranges(std::move(ranges))
{}
private:
private:
static bool has_clustering_keys(const schema& s, const query::read_command& cmd) {
return s.clustering_key_size() > 0
&& !cmd.slice.options.contains<query::partition_slice::option::distinct>();
@@ -229,94 +229,42 @@ private:
class myvisitor : public cql3::selection::result_set_builder::visitor {
public:
impl& _impl;
uint32_t page_size;
uint32_t part_rows = 0;
uint32_t included_rows = 0;
uint32_t total_rows = 0;
std::experimental::optional<partition_key> last_pkey;
std::experimental::optional<clustering_key> last_ckey;
// just for verbosity
uint32_t part_ignored = 0;
clustering_key::less_compare _less;
bool include_row() {
++total_rows;
++part_rows;
if (included_rows >= page_size) {
++part_ignored;
return false;
}
++included_rows;
return true;
}
bool include_row(const clustering_key& key) {
if (!include_row()) {
return false;
}
last_ckey = key;
return true;
}
myvisitor(impl& i, uint32_t ps,
cql3::selection::result_set_builder& builder,
myvisitor(cql3::selection::result_set_builder& builder,
const schema& s,
const cql3::selection::selection& selection)
: visitor(builder, s, selection), _impl(i), page_size(ps), _less(*_impl._schema) {
: visitor(builder, s, selection) {
}
void accept_new_partition(uint32_t) {
throw std::logic_error("Should not reach!");
}
void accept_new_partition(const partition_key& key, uint32_t row_count) {
logger.trace("Begin partition: {} ({})", key, row_count);
part_rows = 0;
part_ignored = 0;
if (included_rows < page_size) {
last_pkey = key;
last_ckey = { };
}
logger.trace("Accepting partition: {} ({})", key, row_count);
total_rows += row_count;
last_pkey = key;
last_ckey = { };
visitor::accept_new_partition(key, row_count);
}
void accept_new_row(const clustering_key& key,
const query::result_row_view& static_row,
const query::result_row_view& row) {
// TODO: should we use exception/long jump or introduce
// a "stop" condition to the calling result_view and
// avoid processing unneeded rows?
auto ok = include_row(key);
if (ok) {
visitor::accept_new_row(key, static_row, row);
}
last_ckey = key;
visitor::accept_new_row(key, static_row, row);
}
void accept_new_row(const query::result_row_view& static_row,
const query::result_row_view& row) {
auto ok = include_row();
if (ok) {
visitor::accept_new_row(static_row, row);
}
visitor::accept_new_row(static_row, row);
}
void accept_partition_end(const query::result_row_view& static_row) {
// accept_partition_end with row_count == 0
// means we had an empty partition but live
// static columns, and since the fix,
// no CK restrictions.
// I.e. _row_count == 0 -> add a partially empty row
// So, treat this case as an accept_row variant
if (_row_count > 0 || include_row()) {
visitor::accept_partition_end(static_row);
}
logger.trace(
"End partition, included={}, ignored={}",
part_rows - part_ignored,
part_ignored);
visitor::accept_partition_end(static_row);
}
};
myvisitor v(*this, std::min(page_size, _max), builder, *_schema, *_selection);
myvisitor v(builder, *_schema, *_selection);
query::result_view::consume(*results, _cmd->slice, v);
if (_last_pkey) {
@@ -328,13 +276,12 @@ private:
_cmd->slice.clear_range(*_schema, *_last_pkey);
}
_max = _max - v.included_rows;
_exhausted = (v.included_rows < page_size && !results->is_short_read()) || _max == 0;
_max = _max - v.total_rows;
_exhausted = (v.total_rows < page_size && !results->is_short_read()) || _max == 0;
_last_pkey = v.last_pkey;
_last_ckey = v.last_ckey;
logger.debug("Fetched {}/{} rows, max_remain={} {}", v.included_rows, v.total_rows,
_max, _exhausted ? "(exh)" : "");
logger.debug("Fetched {} rows, max_remain={} {}", v.total_rows, _max, _exhausted ? "(exh)" : "");
if (_last_pkey) {
logger.debug("Last partition key: {}", *_last_pkey);
@@ -363,7 +310,6 @@ private:
// remember if we use clustering. if not, each partition == one row
const bool _has_clustering_keys;
bool _exhausted = false;
uint32_t _rem = 0;
uint32_t _max;
std::experimental::optional<partition_key> _last_pkey;

View File

@@ -2074,7 +2074,7 @@ public:
versions.reserve(_data_results.front().result->partitions().size());
for (auto& r : _data_results) {
_is_short_read = r.result->is_short_read();
_is_short_read = _is_short_read || r.result->is_short_read();
r.reached_end = !r.result->is_short_read() && r.result->row_count() < cmd.row_limit
&& (cmd.partition_limit == query::max_partitions
|| boost::range::count_if(r.result->partitions(), [] (const partition& p) {
@@ -2346,7 +2346,8 @@ protected:
if (rr_opt && (can_send_short_read || data_resolver->all_reached_end() || rr_opt->row_count() >= original_row_limit()
|| data_resolver->live_partition_count() >= original_partition_limit())
&& !data_resolver->any_partition_short_read()) {
auto result = ::make_foreign(::make_lw_shared(to_data_query_result(std::move(*rr_opt), _schema, _cmd->slice)));
auto result = ::make_foreign(::make_lw_shared(
to_data_query_result(std::move(*rr_opt), _schema, _cmd->slice, _cmd->row_limit, cmd->partition_limit)));
// wait for write to complete before returning result to prevent multiple concurrent read requests to
// trigger repair multiple times and to prevent quorum read to return an old value, even after a quorum
// another read had returned a newer value (but the newer value had not yet been sent to the other replicas)
@@ -2387,6 +2388,13 @@ protected:
_retry_cmd->row_limit = x(cmd->row_limit, data_resolver->total_live_count());
}
}
// We may be unable to send a single live row because of replicas bailing out too early.
// If that is the case disallow short reads so that we can make progress.
if (!data_resolver->total_live_count()) {
_retry_cmd->slice.options.remove<query::partition_slice::option::allow_short_read>();
}
logger.trace("Retrying query with command {} (previous is {})", *_retry_cmd, *cmd);
reconcile(cl, timeout, _retry_cmd);
}
@@ -2626,17 +2634,17 @@ db::read_repair_decision storage_proxy::new_read_repair_decision(const schema& s
}
future<query::result_digest, api::timestamp_type>
storage_proxy::query_singular_local_digest(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr, tracing::trace_state_ptr trace_state) {
return query_singular_local(std::move(s), std::move(cmd), pr, query::result_request::only_digest, std::move(trace_state)).then([] (foreign_ptr<lw_shared_ptr<query::result>> result) {
storage_proxy::query_singular_local_digest(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr, tracing::trace_state_ptr trace_state, uint64_t max_size) {
return query_singular_local(std::move(s), std::move(cmd), pr, query::result_request::only_digest, std::move(trace_state), max_size).then([] (foreign_ptr<lw_shared_ptr<query::result>> result) {
return make_ready_future<query::result_digest, api::timestamp_type>(*result->digest(), result->last_modified());
});
}
future<foreign_ptr<lw_shared_ptr<query::result>>>
storage_proxy::query_singular_local(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr, query::result_request request, tracing::trace_state_ptr trace_state) {
storage_proxy::query_singular_local(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr, query::result_request request, tracing::trace_state_ptr trace_state, uint64_t max_size) {
unsigned shard = _db.local().shard_of(pr.start()->value().token());
return _db.invoke_on(shard, [gs = global_schema_ptr(s), prv = std::vector<query::partition_range>({pr}) /* FIXME: pr is copied */, cmd, request, gt = tracing::global_trace_state_ptr(std::move(trace_state))] (database& db) mutable {
return db.query(gs, *cmd, request, prv, gt).then([](auto&& f) {
return _db.invoke_on(shard, [max_size, gs = global_schema_ptr(s), prv = std::vector<query::partition_range>({pr}) /* FIXME: pr is copied */, cmd, request, gt = tracing::global_trace_state_ptr(std::move(trace_state))] (database& db) mutable {
return db.query(gs, *cmd, request, prv, gt, max_size).then([](auto&& f) {
return make_foreign(std::move(f));
});
});
@@ -2670,7 +2678,7 @@ storage_proxy::query_singular(lw_shared_ptr<query::read_command> cmd, std::vecto
exec.push_back(get_read_executor(cmd, std::move(pr), cl, trace_state));
}
query::result_merger merger;
query::result_merger merger(cmd->row_limit, cmd->partition_limit);
merger.reserve(exec.size());
auto f = ::map_reduce(exec.begin(), exec.end(), [timeout] (::shared_ptr<abstract_read_executor>& rex) {
@@ -2687,7 +2695,8 @@ storage_proxy::query_singular(lw_shared_ptr<query::read_command> cmd, std::vecto
future<std::vector<foreign_ptr<lw_shared_ptr<query::result>>>>
storage_proxy::query_partition_key_range_concurrent(std::chrono::steady_clock::time_point timeout, std::vector<foreign_ptr<lw_shared_ptr<query::result>>>&& results,
lw_shared_ptr<query::read_command> cmd, db::consistency_level cl, std::vector<query::partition_range>::iterator&& i,
std::vector<query::partition_range>&& ranges, int concurrency_factor, tracing::trace_state_ptr trace_state, uint32_t total_row_count) {
std::vector<query::partition_range>&& ranges, int concurrency_factor, tracing::trace_state_ptr trace_state,
uint32_t remaining_row_count, uint32_t remaining_partition_count) {
schema_ptr schema = local_schema_registry().get(cmd->schema_version);
keyspace& ks = _db.local().find_keyspace(schema->ks_name());
std::vector<::shared_ptr<abstract_read_executor>> exec;
@@ -2752,22 +2761,30 @@ storage_proxy::query_partition_key_range_concurrent(std::chrono::steady_clock::t
exec.push_back(::make_shared<range_slice_read_executor>(schema, p, cmd, std::move(range), cl, std::move(filtered_endpoints), trace_state));
}
query::result_merger merger;
query::result_merger merger(cmd->row_limit, cmd->partition_limit);
merger.reserve(exec.size());
auto f = ::map_reduce(exec.begin(), exec.end(), [timeout] (::shared_ptr<abstract_read_executor>& rex) {
return rex->execute(timeout);
}, std::move(merger));
return f.then([p, exec = std::move(exec), results = std::move(results), i = std::move(i), ranges = std::move(ranges), cl, cmd, concurrency_factor, timeout, total_row_count, trace_state = std::move(trace_state)]
return f.then([p, exec = std::move(exec), results = std::move(results), i = std::move(i), ranges = std::move(ranges),
cl, cmd, concurrency_factor, timeout, remaining_row_count, remaining_partition_count, trace_state = std::move(trace_state)]
(foreign_ptr<lw_shared_ptr<query::result>>&& result) mutable {
total_row_count += result->row_count() ? result->row_count().value() :
(logger.error("no row count in query result, should not happen here"), result->calculate_row_count(cmd->slice));
if (!result->row_count() || !result->partition_count()) {
logger.error("no row count in query result, should not happen here");
result->calculate_counts(cmd->slice);
}
remaining_row_count -= result->row_count().value();
remaining_partition_count -= result->partition_count().value();
results.emplace_back(std::move(result));
if (i == ranges.end() || total_row_count >= cmd->row_limit) {
if (i == ranges.end() || !remaining_row_count || !remaining_partition_count) {
return make_ready_future<std::vector<foreign_ptr<lw_shared_ptr<query::result>>>>(std::move(results));
} else {
return p->query_partition_key_range_concurrent(timeout, std::move(results), cmd, cl, std::move(i), std::move(ranges), concurrency_factor, std::move(trace_state), total_row_count);
cmd->row_limit = remaining_row_count;
cmd->partition_limit = remaining_partition_count;
return p->query_partition_key_range_concurrent(timeout, std::move(results), cmd, cl, std::move(i),
std::move(ranges), concurrency_factor, std::move(trace_state), remaining_row_count, remaining_partition_count);
}
}).handle_exception([p] (std::exception_ptr eptr) {
p->handle_read_error(eptr, true);
@@ -2812,9 +2829,10 @@ storage_proxy::query_partition_key_range(lw_shared_ptr<query::read_command> cmd,
logger.debug("Estimated result rows per range: {}; requested rows: {}, ranges.size(): {}; concurrent range requests: {}",
result_rows_per_range, cmd->row_limit, ranges.size(), concurrency_factor);
return query_partition_key_range_concurrent(timeout, std::move(results), cmd, cl, ranges.begin(), std::move(ranges), concurrency_factor, std::move(trace_state))
.then([](std::vector<foreign_ptr<lw_shared_ptr<query::result>>> results) {
query::result_merger merger;
return query_partition_key_range_concurrent(timeout, std::move(results), cmd, cl, ranges.begin(), std::move(ranges), concurrency_factor,
std::move(trace_state), cmd->row_limit, cmd->partition_limit)
.then([row_limit = cmd->row_limit, partition_limit = cmd->partition_limit](std::vector<foreign_ptr<lw_shared_ptr<query::result>>> results) {
query::result_merger merger(row_limit, partition_limit);
merger.reserve(results.size());
for (auto&& r: results) {
@@ -2838,7 +2856,8 @@ storage_proxy::query(schema_ptr s,
logger.trace("query {}.{} cmd={}, ranges={}, id={}", s->ks_name(), s->cf_name(), *cmd, partition_ranges, query_id);
return do_query(s, cmd, std::move(partition_ranges), cl, std::move(trace_state)).then([query_id, cmd, s] (foreign_ptr<lw_shared_ptr<query::result>>&& res) {
if (res->buf().is_linearized()) {
logger.trace("query_result id={}, size={}, rows={}", query_id, res->buf().size(), res->calculate_row_count(cmd->slice));
res->calculate_counts(cmd->slice);
logger.trace("query_result id={}, size={}, rows={}, partitions={}", query_id, res->buf().size(), *res->row_count(), *res->partition_count());
} else {
logger.trace("query_result id={}, size={}", query_id, res->buf().size());
}
@@ -3447,9 +3466,10 @@ void storage_proxy::init_messaging_service() {
tracing::trace(trace_state_ptr, "read_data: message received from /{}", src_addr.addr);
}
auto da = oda.value_or(query::digest_algorithm::MD5);
return do_with(std::move(pr), get_local_shared_storage_proxy(), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr), da] (compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
auto max_size = cinfo.retrieve_auxiliary<uint64_t>("max_result_size");
return do_with(std::move(pr), get_local_shared_storage_proxy(), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr), da, max_size] (compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
auto src_ip = src_addr.addr;
return get_schema_for_read(cmd->schema_version, std::move(src_addr)).then([cmd, da, &pr, &p, &trace_state_ptr] (schema_ptr s) {
return get_schema_for_read(cmd->schema_version, std::move(src_addr)).then([cmd, da, &pr, &p, &trace_state_ptr, max_size] (schema_ptr s) {
auto pr2 = compat::unwrap(std::move(pr), *s);
if (pr2.second) {
// this function assumes singular queries but doesn't validate
@@ -3464,7 +3484,7 @@ void storage_proxy::init_messaging_service() {
qrr = query::result_request::result_and_digest;
break;
}
return p->query_singular_local(std::move(s), cmd, std::move(pr2.first), qrr, trace_state_ptr);
return p->query_singular_local(std::move(s), cmd, std::move(pr2.first), qrr, trace_state_ptr, max_size);
}).finally([&trace_state_ptr, src_ip] () mutable {
tracing::trace(trace_state_ptr, "read_data handling is done, sending a response to /{}", src_ip);
});
@@ -3478,10 +3498,20 @@ void storage_proxy::init_messaging_service() {
tracing::begin(trace_state_ptr);
tracing::trace(trace_state_ptr, "read_mutation_data: message received from /{}", src_addr.addr);
}
return do_with(std::move(pr), get_local_shared_storage_proxy(), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr)] (compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
auto max_size = cinfo.retrieve_auxiliary<uint64_t>("max_result_size");
return do_with(std::move(pr),
get_local_shared_storage_proxy(),
std::move(trace_state_ptr),
compat::one_or_two_partition_ranges({}),
[&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr), max_size] (
compat::wrapping_partition_range& pr,
shared_ptr<storage_proxy>& p,
tracing::trace_state_ptr& trace_state_ptr,
compat::one_or_two_partition_ranges& unwrapped) mutable {
auto src_ip = src_addr.addr;
return get_schema_for_read(cmd->schema_version, std::move(src_addr)).then([cmd, &pr, &p, &trace_state_ptr] (schema_ptr s) mutable {
return p->query_mutations_locally(std::move(s), cmd, compat::unwrap(std::move(pr), *s), trace_state_ptr);
return get_schema_for_read(cmd->schema_version, std::move(src_addr)).then([cmd, &pr, &p, &trace_state_ptr, max_size, &unwrapped] (schema_ptr s) mutable {
unwrapped = compat::unwrap(std::move(pr), *s);
return p->query_mutations_locally(std::move(s), std::move(cmd), unwrapped, trace_state_ptr, max_size);
}).finally([&trace_state_ptr, src_ip] () mutable {
tracing::trace(trace_state_ptr, "read_mutation_data handling is done, sending a response to /{}", src_ip);
});
@@ -3495,15 +3525,16 @@ void storage_proxy::init_messaging_service() {
tracing::begin(trace_state_ptr);
tracing::trace(trace_state_ptr, "read_digest: message received from /{}", src_addr.addr);
}
return do_with(std::move(pr), get_local_shared_storage_proxy(), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr)] (compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
auto max_size = cinfo.retrieve_auxiliary<uint64_t>("max_result_size");
return do_with(std::move(pr), get_local_shared_storage_proxy(), std::move(trace_state_ptr), [&cinfo, cmd = make_lw_shared<query::read_command>(std::move(cmd)), src_addr = std::move(src_addr), max_size] (compat::wrapping_partition_range& pr, shared_ptr<storage_proxy>& p, tracing::trace_state_ptr& trace_state_ptr) mutable {
auto src_ip = src_addr.addr;
return get_schema_for_read(cmd->schema_version, std::move(src_addr)).then([cmd, &pr, &p, &trace_state_ptr] (schema_ptr s) {
return get_schema_for_read(cmd->schema_version, std::move(src_addr)).then([cmd, &pr, &p, &trace_state_ptr, max_size] (schema_ptr s) {
auto pr2 = compat::unwrap(std::move(pr), *s);
if (pr2.second) {
// this function assumes singular queries but doesn't validate
throw std::runtime_error("READ_DIGEST called with wrapping range");
}
return p->query_singular_local_digest(std::move(s), cmd, std::move(pr2.first), trace_state_ptr);
return p->query_singular_local_digest(std::move(s), cmd, std::move(pr2.first), trace_state_ptr, max_size);
}).finally([&trace_state_ptr, src_ip] () mutable {
tracing::trace(trace_state_ptr, "read_digest handling is done, sending a response to /{}", src_ip);
});
@@ -3539,41 +3570,166 @@ void storage_proxy::uninit_messaging_service() {
// Merges reconcilable_result:s from different shards into one
// Drops partitions which exceed the limit.
class mutation_result_merger {
schema_ptr _schema;
lw_shared_ptr<const query::read_command> _cmd;
unsigned _row_count = 0;
unsigned _partition_count = 0;
bool _short_read_allowed;
query::short_read _short_read;
std::vector<partition> _partitions;
// we get a batch of partitions each time, each with a key
// partition batches should be maintained in key order
// batches that share a key should be merged and sorted in decorated_key
// order
struct partitions_batch {
std::vector<partition> partitions;
query::short_read short_read;
};
std::multimap<unsigned, partitions_batch> _partitions;
query::result_memory_accounter _memory_accounter;
stdx::optional<unsigned> _stop_after_key;
public:
explicit mutation_result_merger(const query::read_command& cmd)
: _short_read_allowed(cmd.slice.options.contains(query::partition_slice::option::allow_short_read))
{ }
explicit mutation_result_merger(schema_ptr schema, lw_shared_ptr<const query::read_command> cmd)
: _schema(std::move(schema))
, _cmd(std::move(cmd))
, _short_read_allowed(_cmd->slice.options.contains(query::partition_slice::option::allow_short_read)) {
}
query::result_memory_accounter& memory() {
return _memory_accounter;
}
const query::result_memory_accounter& memory() const {
return _memory_accounter;
}
void add_result(foreign_ptr<lw_shared_ptr<reconcilable_result>> partial_result) {
void add_result(unsigned key, foreign_ptr<lw_shared_ptr<reconcilable_result>> partial_result) {
if (_stop_after_key && key > *_stop_after_key) {
// A short result was added that goes before this one.
return;
}
std::vector<partition> partitions;
partitions.reserve(partial_result->partitions().size());
// Following three lines to simplify patch; can remove later
for (const partition& p : partial_result->partitions()) {
_partitions.push_back(p);
partitions.push_back(p);
_row_count += p._row_count;
_partition_count += p._row_count > 0;
}
_short_read = partial_result->is_short_read();
if (_memory_accounter.update_and_check(partial_result->memory_usage()) && _short_read_allowed) {
_short_read = query::short_read::yes;
_memory_accounter.update(partial_result->memory_usage());
if (partial_result->is_short_read()) {
_stop_after_key = key;
}
_partitions.emplace(key, partitions_batch { std::move(partitions), partial_result->is_short_read() });
}
reconcilable_result get() && {
return reconcilable_result(_row_count, std::move(_partitions), _short_read,
std::move(_memory_accounter).done());
auto unsorted = std::unordered_set<unsigned>();
struct partitions_and_last_key {
std::vector<partition> partitions;
stdx::optional<dht::decorated_key> last; // set if we had a short read
};
auto merged = std::map<unsigned, partitions_and_last_key>();
auto short_read = query::short_read(this->short_read());
// merge batches with equal keys, and note if we need to sort afterwards
for (auto&& key_value : _partitions) {
auto&& key = key_value.first;
if (_stop_after_key && key > *_stop_after_key) {
break;
}
auto&& batch = key_value.second;
auto&& dest = merged[key];
if (dest.partitions.empty()) {
dest.partitions = std::move(batch.partitions);
} else {
unsorted.insert(key);
std::move(batch.partitions.begin(), batch.partitions.end(), std::back_inserter(dest.partitions));
}
// In case of a short read we need to remove all partitions from the
// batch that come after the last partition of the short read
// result.
if (batch.short_read) {
// Nobody sends a short read with no data.
const auto& last = dest.partitions.back().mut().decorated_key(*_schema);
if (!dest.last || last.less_compare(*_schema, *dest.last)) {
dest.last = last;
}
short_read = query::short_read::yes;
}
}
// Sort batches that arrived with the same keys
for (auto key : unsorted) {
struct comparator {
const schema& s;
dht::decorated_key::less_comparator dkcmp;
bool operator()(const partition& a, const partition& b) const {
return dkcmp(a.mut().decorated_key(s), b.mut().decorated_key(s));
}
bool operator()(const dht::decorated_key& a, const partition& b) const {
return dkcmp(a, b.mut().decorated_key(s));
}
bool operator()(const partition& a, const dht::decorated_key& b) const {
return dkcmp(a.mut().decorated_key(s), b);
}
};
auto cmp = comparator { *_schema, dht::decorated_key::less_comparator(_schema) };
auto&& batch = merged[key];
boost::sort(batch.partitions, cmp);
if (batch.last) {
// This batch was built from a result that was a short read.
// We need to remove all partitions that are after that short
// read.
auto it = boost::range::upper_bound(batch.partitions, std::move(*batch.last), cmp);
batch.partitions.erase(it, batch.partitions.end());
}
}
auto final = std::vector<partition>();
final.reserve(_partition_count);
for (auto&& batch : merged | boost::adaptors::map_values) {
std::move(batch.partitions.begin(), batch.partitions.end(), std::back_inserter(final));
}
if (short_read) {
// Short read row and partition counts may be incorrect, recalculate.
_row_count = 0;
_partition_count = 0;
for (const auto& p : final) {
_row_count += p.row_count();
_partition_count += p.row_count() > 0;
}
if (_row_count >= _cmd->row_limit || _partition_count > _cmd->partition_limit) {
// Even though there was a short read contributing to the final
// result we got limited by total row limit or partition limit.
// Note that we cannot with trivial check make unset short read flag
// in case _partition_count == _cmd->partition_limit since the short
// read may have caused the last partition to contain less rows
// than asked for.
short_read = query::short_read::no;
}
}
// Trim back partition count and row count in case we overshot.
// Should be rare for dense tables.
while ((_partition_count > _cmd->partition_limit)
|| (_partition_count && (_row_count - final.back().row_count() >= _cmd->row_limit))) {
_row_count -= final.back().row_count();
_partition_count -= final.back().row_count() > 0;
final.pop_back();
}
if (_row_count > _cmd->row_limit) {
auto mut = final.back().mut().unfreeze(_schema);
static const auto all = std::vector<query::clustering_range>({query::clustering_range::make_open_ended_both_sides()});
auto is_reversed = _cmd->slice.options.contains(query::partition_slice::option::reversed);
auto final_rows = _cmd->row_limit - (_row_count - final.back().row_count());
_row_count -= final.back().row_count();
auto rc = mut.partition().compact_for_query(*_schema, _cmd->timestamp, all, is_reversed, final_rows);
final.back() = partition(rc, freeze(mut));
_row_count += rc;
}
return reconcilable_result(_row_count, std::move(final), short_read, std::move(_memory_accounter).done());
}
bool short_read() const {
return bool(_short_read);
return bool(_stop_after_key) || (_short_read_allowed && _row_count > 0 && _memory_accounter.check());
}
unsigned partition_count() const {
return _partition_count;
@@ -3584,65 +3740,159 @@ public:
};
future<foreign_ptr<lw_shared_ptr<reconcilable_result>>>
storage_proxy::query_mutations_locally(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr, tracing::trace_state_ptr trace_state) {
storage_proxy::query_mutations_locally(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr,
tracing::trace_state_ptr trace_state, uint64_t max_size) {
if (pr.is_singular()) {
unsigned shard = _db.local().shard_of(pr.start()->value().token());
return _db.invoke_on(shard, [cmd, &pr, gs=global_schema_ptr(s), gt = tracing::global_trace_state_ptr(std::move(trace_state))] (database& db) mutable {
return db.get_result_memory_limiter().new_read().then([&] (query::result_memory_accounter ma) {
return _db.invoke_on(shard, [max_size, cmd, &pr, gs=global_schema_ptr(s), gt = tracing::global_trace_state_ptr(std::move(trace_state))] (database& db) mutable {
return db.get_result_memory_limiter().new_mutation_read(max_size).then([&] (query::result_memory_accounter ma) {
return db.query_mutations(gs, *cmd, pr, std::move(ma), gt).then([] (reconcilable_result&& result) {
return make_foreign(make_lw_shared(std::move(result)));
});
});
});
} else {
return query_nonsingular_mutations_locally(std::move(s), std::move(cmd), {pr}, std::move(trace_state));
return query_nonsingular_mutations_locally(std::move(s), std::move(cmd), {pr}, std::move(trace_state), max_size);
}
}
future<foreign_ptr<lw_shared_ptr<reconcilable_result>>>
storage_proxy::query_mutations_locally(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const compat::one_or_two_partition_ranges& pr, tracing::trace_state_ptr trace_state) {
storage_proxy::query_mutations_locally(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const compat::one_or_two_partition_ranges& pr,
tracing::trace_state_ptr trace_state, uint64_t max_size) {
if (!pr.second) {
return query_mutations_locally(std::move(s), std::move(cmd), pr.first, std::move(trace_state));
return query_mutations_locally(std::move(s), std::move(cmd), pr.first, std::move(trace_state), max_size);
} else {
return query_nonsingular_mutations_locally(std::move(s), std::move(cmd), pr, std::move(trace_state));
return query_nonsingular_mutations_locally(std::move(s), std::move(cmd), pr, std::move(trace_state), max_size);
}
}
}
namespace {
struct element_and_shard {
unsigned element; // element in a partition range vector
unsigned shard;
};
bool operator==(element_and_shard a, element_and_shard b) {
return a.element == b.element && a.shard == b.shard;
}
}
namespace std {
template <>
struct hash<element_and_shard> {
size_t operator()(element_and_shard es) const {
return es.element * 31 + es.shard;
}
};
}
namespace service {
struct partition_range_and_sort_key {
query::partition_range pr;
unsigned sort_key_shard_order; // for the same source partition range, we sort in shard order
};
future<foreign_ptr<lw_shared_ptr<reconcilable_result>>>
storage_proxy::query_nonsingular_mutations_locally(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const std::vector<query::partition_range>& prs, tracing::trace_state_ptr trace_state) {
storage_proxy::query_nonsingular_mutations_locally(schema_ptr s, lw_shared_ptr<query::read_command> cmd, const std::vector<query::partition_range>& prs,
tracing::trace_state_ptr trace_state, uint64_t max_size) {
// no one permitted us to modify *cmd, so make a copy
auto shard_cmd = make_lw_shared<query::read_command>(*cmd);
return do_with(cmd,
shard_cmd,
mutation_result_merger{*cmd},
1u,
0u,
false,
static_cast<unsigned>(prs.size()),
std::unordered_map<element_and_shard, partition_range_and_sort_key>{},
mutation_result_merger{s, cmd},
dht::ring_position_range_vector_sharder{prs},
global_schema_ptr(s),
tracing::global_trace_state_ptr(std::move(trace_state)),
[this, s] (lw_shared_ptr<query::read_command>& cmd,
[this, s, max_size] (lw_shared_ptr<query::read_command>& cmd,
lw_shared_ptr<query::read_command>& shard_cmd,
unsigned& shards_in_parallel,
unsigned& mutation_result_merger_key,
bool& no_more_ranges,
unsigned& partition_range_count,
std::unordered_map<element_and_shard, partition_range_and_sort_key>& shards_for_this_iteration,
mutation_result_merger& mrm,
dht::ring_position_range_vector_sharder& rprs,
global_schema_ptr& gs,
tracing::global_trace_state_ptr& gt) {
return _db.local().get_result_memory_limiter().new_read().then([&, s] (query::result_memory_accounter ma) {
return _db.local().get_result_memory_limiter().new_mutation_read(max_size).then([&, s] (query::result_memory_accounter ma) {
mrm.memory() = std::move(ma);
return repeat_until_value([&, s] () -> future<stdx::optional<reconcilable_result>> {
auto now = rprs.next(*s);
if (!now) {
return make_ready_future<stdx::optional<reconcilable_result>>(std::move(mrm).get());
// We don't want to query a sparsely populated table sequentially, because the latency
// will go through the roof. We don't want to query a densely populated table in parallel,
// because we'll throw away most of the results. So we'll exponentially increase
// concurrency starting at 1, so we won't waste on dense tables and at most
// `log(nr_shards) + ignore_msb_bits` latency multiplier for near-empty tables.
shards_for_this_iteration.clear();
// If we're reading from less than smp::count shards, then we can just append
// each shard in order without sorting. If we're reading from more, then
// we'll read from some shards at least twice, so the partitions within will be
// out-of-order wrt. other shards
auto retain_shard_order = true;
for (auto i = 0u; i < shards_in_parallel; ++i) {
auto now = rprs.next(*s);
if (!now) {
no_more_ranges = true;
break;
}
// Let's see if this is a new shard, or if we can expand an existing range
auto&& rng_ok = shards_for_this_iteration.emplace(element_and_shard{now->element, now->shard}, partition_range_and_sort_key{now->ring_range, i});
if (!rng_ok.second) {
// We saw this shard already, enlarge the range (we know now->ring_range came from the same partition range;
// otherwise it would have had a unique now->element).
auto& rng = rng_ok.first->second.pr;
rng = nonwrapping_range<dht::ring_position>(std::move(rng.start()), std::move(now->ring_range.end()));
// This range is no longer ordered with respect to the others, so:
retain_shard_order = false;
}
}
auto key_base = mutation_result_merger_key;
// prepare for next iteration
// Each iteration uses a merger key that is either i in the loop above (so in the range [0, shards_in_parallel),
// or, the element index in prs (so in the range [0, partition_range_count). Make room for sufficient keys.
mutation_result_merger_key += std::max(shards_in_parallel, partition_range_count);
shards_in_parallel *= 2;
shard_cmd->partition_limit = cmd->partition_limit - mrm.partition_count();
shard_cmd->row_limit = cmd->row_limit - mrm.row_count();
return _db.invoke_on(now->shard, [&, now = std::move(*now), gt] (database& db) {
query::result_memory_accounter accounter(db.get_result_memory_limiter(), mrm.memory());
return db.query_mutations(gs, *shard_cmd, now.ring_range, std::move(accounter), std::move(gt)).then([] (reconcilable_result&& rr) {
return make_foreign(make_lw_shared(std::move(rr)));
return parallel_for_each(shards_for_this_iteration, [&, key_base, retain_shard_order] (const std::pair<const element_and_shard, partition_range_and_sort_key>& elem_shard_range) {
auto&& elem = elem_shard_range.first.element;
auto&& shard = elem_shard_range.first.shard;
auto&& range = elem_shard_range.second.pr;
auto sort_key_shard_order = elem_shard_range.second.sort_key_shard_order;
return _db.invoke_on(shard, [&, range, gt, fstate = mrm.memory().state_for_another_shard()] (database& db) {
query::result_memory_accounter accounter(db.get_result_memory_limiter(), std::move(fstate));
return db.query_mutations(gs, *shard_cmd, range, std::move(accounter), std::move(gt)).then([] (reconcilable_result&& rr) {
return make_foreign(make_lw_shared(std::move(rr)));
});
}).then([&, key_base, retain_shard_order, elem, sort_key_shard_order] (foreign_ptr<lw_shared_ptr<reconcilable_result>> partial_result) {
// Each outer (sequential) iteration is in result order, so we pick increasing keys.
// Within the inner (parallel) iteration, the results can be in order (if retain_shard_order), or not (if !retain_shard_order).
// If the results are unordered, we still have to order them according to which element of prs they originated from.
auto key = key_base; // for outer loop
if (retain_shard_order) {
key += sort_key_shard_order; // inner loop is ordered
} else {
key += elem; // inner loop ordered only by position within prs
}
mrm.add_result(key, std::move(partial_result));
});
}).then([&] (foreign_ptr<lw_shared_ptr<reconcilable_result>> rr) -> stdx::optional<reconcilable_result> {
mrm.add_result(std::move(rr));
if (mrm.short_read() || mrm.partition_count() >= cmd->partition_limit || mrm.row_count() >= cmd->row_limit) {
return std::move(mrm).get();
}).then([&] () -> stdx::optional<reconcilable_result> {
if (mrm.short_read() || mrm.partition_count() >= cmd->partition_limit || mrm.row_count() >= cmd->row_limit || no_more_ranges) {
return stdx::make_optional(std::move(mrm).get());
}
return stdx::nullopt;
});

View File

@@ -235,15 +235,18 @@ private:
::shared_ptr<abstract_read_executor> get_read_executor(lw_shared_ptr<query::read_command> cmd, query::partition_range pr, db::consistency_level cl, tracing::trace_state_ptr trace_state);
future<foreign_ptr<lw_shared_ptr<query::result>>> query_singular_local(schema_ptr, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr,
query::result_request request,
tracing::trace_state_ptr trace_state);
future<query::result_digest, api::timestamp_type> query_singular_local_digest(schema_ptr, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr, tracing::trace_state_ptr trace_state);
future<foreign_ptr<lw_shared_ptr<query::result>>> query_partition_key_range(lw_shared_ptr<query::read_command> cmd, std::vector<query::partition_range> partition_ranges, db::consistency_level cl, tracing::trace_state_ptr trace_state);
tracing::trace_state_ptr trace_state,
uint64_t max_size = query::result_memory_limiter::maximum_result_size);
future<query::result_digest, api::timestamp_type> query_singular_local_digest(schema_ptr, lw_shared_ptr<query::read_command> cmd, const query::partition_range& pr, tracing::trace_state_ptr trace_state,
uint64_t max_size = query::result_memory_limiter::maximum_result_size);
future<foreign_ptr<lw_shared_ptr<query::result>>> query_partition_key_range(lw_shared_ptr<query::read_command> cmd, const std::vector<query::partition_range> partition_ranges, db::consistency_level cl, tracing::trace_state_ptr trace_state);
std::vector<query::partition_range> get_restricted_ranges(keyspace& ks, const schema& s, query::partition_range range);
float estimate_result_rows_per_range(lw_shared_ptr<query::read_command> cmd, keyspace& ks);
static std::vector<gms::inet_address> intersection(const std::vector<gms::inet_address>& l1, const std::vector<gms::inet_address>& l2);
future<std::vector<foreign_ptr<lw_shared_ptr<query::result>>>> query_partition_key_range_concurrent(std::chrono::steady_clock::time_point timeout,
std::vector<foreign_ptr<lw_shared_ptr<query::result>>>&& results, lw_shared_ptr<query::read_command> cmd, db::consistency_level cl, std::vector<query::partition_range>::iterator&& i,
std::vector<query::partition_range>&& ranges, int concurrency_factor, tracing::trace_state_ptr trace_state, uint32_t total_row_count = 0);
std::vector<query::partition_range>&& ranges, int concurrency_factor, tracing::trace_state_ptr trace_state,
uint32_t remaining_row_count, uint32_t remaining_partition_count);
future<foreign_ptr<lw_shared_ptr<query::result>>> do_query(schema_ptr,
lw_shared_ptr<query::read_command> cmd,
@@ -262,7 +265,7 @@ private:
template<typename Range>
future<> mutate_internal(Range mutations, db::consistency_level cl, tracing::trace_state_ptr tr_state);
future<foreign_ptr<lw_shared_ptr<reconcilable_result>>> query_nonsingular_mutations_locally(
schema_ptr s, lw_shared_ptr<query::read_command> cmd, const std::vector<query::partition_range>& pr, tracing::trace_state_ptr trace_state);
schema_ptr s, lw_shared_ptr<query::read_command> cmd, const std::vector<query::partition_range>& pr, tracing::trace_state_ptr trace_state, uint64_t max_size);
public:
storage_proxy(distributed<database>& db);
@@ -337,16 +340,19 @@ public:
future<foreign_ptr<lw_shared_ptr<reconcilable_result>>> query_mutations_locally(
schema_ptr, lw_shared_ptr<query::read_command> cmd, const query::partition_range&,
tracing::trace_state_ptr trace_state = nullptr);
tracing::trace_state_ptr trace_state = nullptr,
uint64_t max_size = query::result_memory_limiter::maximum_result_size);
future<foreign_ptr<lw_shared_ptr<reconcilable_result>>> query_mutations_locally(
schema_ptr, lw_shared_ptr<query::read_command> cmd, const compat::one_or_two_partition_ranges&,
tracing::trace_state_ptr trace_state = nullptr);
tracing::trace_state_ptr trace_state = nullptr,
uint64_t max_size = query::result_memory_limiter::maximum_result_size);
future<foreign_ptr<lw_shared_ptr<reconcilable_result>>> query_mutations_locally(
schema_ptr s, lw_shared_ptr<query::read_command> cmd, const std::vector<query::partition_range>& pr,
tracing::trace_state_ptr trace_state = nullptr);
tracing::trace_state_ptr trace_state = nullptr,
uint64_t max_size = query::result_memory_limiter::maximum_result_size);
future<> stop();

View File

@@ -1703,7 +1703,7 @@ file_writer components_writer::index_file_writer(sstable& sst, const io_priority
options.buffer_size = sst.sstable_buffer_size;
options.io_priority_class = pc;
options.write_behind = 10;
return file_writer(sst._index_file, std::move(options));
return file_writer(std::move(sst._index_file), std::move(options));
}
// Get the currently loaded configuration, or the default configuration in
@@ -1855,7 +1855,6 @@ void components_writer::consume_end_of_stream() {
seal_summary(_sst._summary, std::move(_first_key), std::move(_last_key)); // what if there is only one partition? what if it is empty?
_index.close().get();
_sst._index_file = file(); // index->close() closed _index_file
if (_sst.has_component(sstable::component_type::CompressionInfo)) {
_sst._collector.add_compression_ratio(_sst._compression.compressed_file_length(), _sst._compression.uncompressed_file_length());
@@ -1905,17 +1904,16 @@ void sstable_writer::prepare_file_writer()
options.write_behind = 10;
if (!_compression_enabled) {
_writer = make_shared<checksummed_file_writer>(_sst._data_file, std::move(options), true);
_writer = make_shared<checksummed_file_writer>(std::move(_sst._data_file), std::move(options), true);
} else {
prepare_compression(_sst._compression, _schema);
_writer = make_shared<file_writer>(make_compressed_file_output_stream(_sst._data_file, std::move(options), &_sst._compression));
_writer = make_shared<file_writer>(make_compressed_file_output_stream(std::move(_sst._data_file), std::move(options), &_sst._compression));
}
}
void sstable_writer::finish_file_writer()
{
_writer->close().get();
_sst._data_file = file(); // w->close() closed _data_file
if (!_compression_enabled) {
auto chksum_wr = static_pointer_cast<checksummed_file_writer>(_writer);

View File

@@ -153,7 +153,12 @@ struct summary_ka {
* Similar to origin off heap size
*/
uint64_t memory_footprint() const {
return sizeof(summary_entry) * entries.size() + sizeof(uint32_t) * positions.size() + sizeof(*this);
auto sz = sizeof(summary_entry) * entries.size() + sizeof(uint32_t) * positions.size() + sizeof(*this);
sz += first_key.value.size() + last_key.value.size();
for (auto& e : entries) {
sz += e.key.size();
}
return sz;
}
explicit operator bool() const {

View File

@@ -56,23 +56,24 @@ SEASTAR_TEST_CASE(test_querying_with_limits) {
pranges.emplace_back(query::partition_range::make_singular(dht::global_partitioner().decorate_key(*s, std::move(pkey))));
}
auto max_size = std::numeric_limits<size_t>::max();
{
auto cmd = query::read_command(s->id(), s->version(), partition_slice_builder(*s).build(), 3);
auto result = db.query(s, cmd, query::result_request::only_result, pranges, nullptr).get0();
auto result = db.query(s, cmd, query::result_request::only_result, pranges, nullptr, max_size).get0();
assert_that(query::result_set::from_raw_result(s, cmd.slice, *result)).has_size(3);
}
{
auto cmd = query::read_command(s->id(), s->version(), partition_slice_builder(*s).build(),
query::max_rows, gc_clock::now(), std::experimental::nullopt, 5);
auto result = db.query(s, cmd, query::result_request::only_result, pranges, nullptr).get0();
auto result = db.query(s, cmd, query::result_request::only_result, pranges, nullptr, max_size).get0();
assert_that(query::result_set::from_raw_result(s, cmd.slice, *result)).has_size(5);
}
{
auto cmd = query::read_command(s->id(), s->version(), partition_slice_builder(*s).build(),
query::max_rows, gc_clock::now(), std::experimental::nullopt, 3);
auto result = db.query(s, cmd, query::result_request::only_result, pranges, nullptr).get0();
auto result = db.query(s, cmd, query::result_request::only_result, pranges, nullptr, max_size).get0();
assert_that(query::result_set::from_raw_result(s, cmd.slice, *result)).has_size(3);
}
});

View File

@@ -131,7 +131,7 @@ BOOST_AUTO_TEST_CASE(test_simple_compound)
BOOST_REQUIRE_EQUAL(buf1.size(), 12);
bytes_ostream buf2;
ser::writer_of_writable_simple_compound wowsc(buf2);
ser::writer_of_writable_simple_compound<bytes_ostream> wowsc(buf2);
std::move(wowsc).write_foo(sc.foo).write_bar(sc.bar).end_writable_simple_compound();
BOOST_REQUIRE_EQUAL(buf1.linearize(), buf2.linearize());
@@ -170,7 +170,7 @@ BOOST_AUTO_TEST_CASE(test_vector)
BOOST_REQUIRE_EQUAL(buf1.size(), 136);
bytes_ostream buf2;
ser::writer_of_writable_vectors_of_compounds wowvoc(buf2);
ser::writer_of_writable_vectors_of_compounds<bytes_ostream> wowvoc(buf2);
auto first_writer = std::move(wowvoc).start_first();
for (auto& c : vec1) {
first_writer.add().write_foo(c.foo).write_bar(c.bar).end_writable_simple_compound();
@@ -221,7 +221,7 @@ BOOST_AUTO_TEST_CASE(test_variant)
simple_compound sc2 = { 0x12344321, 0x56788765 };
bytes_ostream buf;
ser::writer_of_writable_variants wowv(buf);
ser::writer_of_writable_variants<bytes_ostream> wowv(buf);
auto second_writer = std::move(wowv).write_id(17).write_first_simple_compound(sc).start_second_writable_vector().start_vector();
for (auto&& v : vec) {
second_writer.add_vector(v);

View File

@@ -76,8 +76,10 @@ static query::partition_slice make_full_slice(const schema& s) {
return partition_slice_builder(s).build();
}
static auto inf32 = std::numeric_limits<unsigned>::max();
query::result_set to_result_set(const reconcilable_result& r, schema_ptr s, const query::partition_slice& slice) {
return query::result_set::from_raw_result(s, slice, to_data_query_result(r, s, slice));
return query::result_set::from_raw_result(s, slice, to_data_query_result(r, s, slice, inf32, inf32));
}
SEASTAR_TEST_CASE(test_reading_from_single_partition) {
@@ -460,25 +462,24 @@ SEASTAR_TEST_CASE(test_result_row_count) {
auto src = make_source({m1});
auto r = to_data_query_result(mutation_query(s, make_source({m1}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice);
auto r = to_data_query_result(mutation_query(s, make_source({m1}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice, inf32, inf32);
BOOST_REQUIRE_EQUAL(r.row_count().value(), 0);
m1.set_static_cell("s1", data_value(bytes("S_v1")), 1);
r = to_data_query_result(mutation_query(s, make_source({m1}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice);
r = to_data_query_result(mutation_query(s, make_source({m1}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice, inf32, inf32);
BOOST_REQUIRE_EQUAL(r.row_count().value(), 1);
m1.set_clustered_cell(clustering_key::from_single_value(*s, bytes("A")), "v1", data_value(bytes("A_v1")), 1);
r = to_data_query_result(mutation_query(s, make_source({m1}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice);
r = to_data_query_result(mutation_query(s, make_source({m1}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice, inf32, inf32);
BOOST_REQUIRE_EQUAL(r.row_count().value(), 1);
m1.set_clustered_cell(clustering_key::from_single_value(*s, bytes("B")), "v1", data_value(bytes("B_v1")), 1);
r = to_data_query_result(mutation_query(s, make_source({m1}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice);
r = to_data_query_result(mutation_query(s, make_source({m1}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice, inf32, inf32);
BOOST_REQUIRE_EQUAL(r.row_count().value(), 2);
mutation m2(partition_key::from_single_value(*s, "key2"), s);
m2.set_static_cell("s1", data_value(bytes("S_v1")), 1);
r = to_data_query_result(mutation_query(s, make_source({m1, m2}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice);
r = to_data_query_result(mutation_query(s, make_source({m1, m2}), query::full_partition_range, slice, 10000, query::max_partitions, now).get0(), s, slice, inf32, inf32);
BOOST_REQUIRE_EQUAL(r.row_count().value(), 3);
});
}

View File

@@ -118,11 +118,27 @@ future<> trace_keyspace_helper::setup_table(const sstring& name, const sstring&
return make_ready_future<>();
}
::shared_ptr<cql3::statements::raw::cf_statement> parsed = static_pointer_cast<
cql3::statements::raw::cf_statement>(cql3::query_processor::parse_statement(cql));
parsed->prepare_keyspace(KEYSPACE_NAME);
::shared_ptr<cql3::statements::create_table_statement> statement =
static_pointer_cast<cql3::statements::create_table_statement>(
parsed->prepare(db, qp.get_cql_stats())->statement);
auto schema = statement->get_cf_meta_data();
// Generate the CF UUID based on its KF names. This is needed to ensure that
// all Nodes that create it would create it with the same UUID and we don't
// hit the #420 issue.
auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());
schema_builder b(schema);
b.set_uuid(uuid);
// We don't care it it fails really - this may happen due to concurrent
// "CREATE TABLE" invocation on different Nodes.
// The important thing is that it will converge eventually (some traces may
// be lost in a process but that's ok).
return qp.process(cql, db::consistency_level::ONE).discard_result().handle_exception([this] (auto ep) {});
return service::get_local_migration_manager().announce_new_column_family(b.build(), false).discard_result().handle_exception([this] (auto ep) {});;
}
bool trace_keyspace_helper::cache_sessions_table_handles(const schema_ptr& schema) {

View File

@@ -55,9 +55,10 @@ std::vector<sstring> trace_type_names = {
"REPAIR"
};
tracing::tracing(const sstring& tracing_backend_helper_class_name)
tracing::tracing(sstring tracing_backend_helper_class_name)
: _write_timer([this] { write_timer_callback(); })
, _thread_name(seastar::format("shard {:d}", engine().cpu_id()))
, _tracing_backend_helper_class_name(std::move(tracing_backend_helper_class_name))
, _registrations{
scollectd::add_polled_metric(scollectd::type_instance_id("tracing"
, scollectd::per_cpu_plugin_instance
@@ -93,27 +94,23 @@ tracing::tracing(const sstring& tracing_backend_helper_class_name)
, scollectd::make_typed(scollectd::data_type::GAUGE, _flushing_records))}
, _gen(std::random_device()())
, _slow_query_duration_threshold(default_slow_query_duraion_threshold)
, _slow_query_record_ttl(default_slow_query_record_ttl) {
try {
_tracing_backend_helper_ptr = create_object<i_tracing_backend_helper>(tracing_backend_helper_class_name, *this);
} catch (no_such_class& e) {
tracing_logger.error("Can't create tracing backend helper {}: not supported", tracing_backend_helper_class_name);
throw;
} catch (...) {
throw;
}
, _slow_query_record_ttl(default_slow_query_record_ttl) {}
future<> tracing::create_tracing(sstring tracing_backend_class_name) {
return tracing_instance().start(std::move(tracing_backend_class_name));
}
future<> tracing::create_tracing(const sstring& tracing_backend_class_name) {
return tracing_instance().start(tracing_backend_class_name).then([] {
return tracing_instance().invoke_on_all([] (tracing& local_tracing) {
return local_tracing.start();
});
future<> tracing::start_tracing() {
return tracing_instance().invoke_on_all([] (tracing& local_tracing) {
return local_tracing.start();
});
}
trace_state_ptr tracing::create_session(trace_type type, trace_state_props_set props) {
trace_state_ptr tstate;
if (!started()) {
return trace_state_ptr();
}
try {
// Don't create a session if its records are likely to be dropped
if (!may_create_new_session()) {
@@ -129,6 +126,10 @@ trace_state_ptr tracing::create_session(trace_type type, trace_state_props_set p
}
trace_state_ptr tracing::create_session(const trace_info& secondary_session_info) {
if (!started()) {
return trace_state_ptr();
}
try {
// Don't create a session if its records are likely to be dropped
if (!may_create_new_session(secondary_session_info.session_id)) {
@@ -144,7 +145,17 @@ trace_state_ptr tracing::create_session(const trace_info& secondary_session_info
}
future<> tracing::start() {
try {
_tracing_backend_helper_ptr = create_object<i_tracing_backend_helper>(_tracing_backend_helper_class_name, *this);
} catch (no_such_class& e) {
tracing_logger.error("Can't create tracing backend helper {}: not supported", _tracing_backend_helper_class_name);
throw;
} catch (...) {
throw;
}
return _tracing_backend_helper_ptr->start().then([this] {
_down = false;
_write_timer.arm(write_period);
});
}

View File

@@ -43,6 +43,7 @@
#include <vector>
#include <atomic>
#include <random>
#include <seastar/core/scollectd.hh>
#include <seastar/core/sharded.hh>
#include <seastar/core/sstring.hh>
#include "gc_clock.hh"
@@ -345,10 +346,15 @@ private:
records_bulk _pending_for_write_records_bulk;
timer<lowres_clock> _write_timer;
bool _down = false;
// _down becomes FALSE after the local service is fully initialized and
// tracing records are allowed to be created and collected. It becomes TRUE
// after the shutdown() call and prevents further write attempts to I/O
// backend.
bool _down = true;
bool _slow_query_logging_enabled = false;
std::unique_ptr<i_tracing_backend_helper> _tracing_backend_helper_ptr;
sstring _thread_name;
sstring _tracing_backend_helper_class_name;
scollectd::registrations _registrations;
double _trace_probability = 0.0; // keep this one for querying purposes
uint64_t _normalized_trace_probability = 0;
@@ -376,8 +382,13 @@ public:
return tracing_instance().local();
}
static future<> create_tracing(const sstring& tracing_backend_helper_class_name);
tracing(const sstring& tracing_backend_helper_class_name);
bool started() const {
return !_down;
}
static future<> create_tracing(sstring tracing_backend_helper_class_name);
static future<> start_tracing();
tracing(sstring tracing_backend_helper_class_name);
// Initialize a tracing backend (e.g. tracing_keyspace or logstash)
future<> start();