Compare commits

..

1015 Commits

Author SHA1 Message Date
Avi Kivity
19907fad15 sstables: fix use-after-free in read_simple()
`r` is moved-from, and later captured in a different lambda. The compiler may
choose to move and perform the other capture later, resulting in a use-after-free.

Fix by copying `r` instead of moving it.

Discovered by sstable_test in debug mode.
Message-Id: <20170702082546.20570-1-avi@scylladb.com>

(cherry picked from commit 07b8adce0e)
2018-02-01 14:28:59 +01:00
Avi Kivity
97f781c4d8 Update seastar submodule
* seastar e23b9b8...a66e0c5 (3):
  > posix.hh: add missing include
  > tls_test: Fix echo test not setting server trust store
  > tls: Actually verify client certificate if requested

Fixes #3072
2018-01-29 15:26:24 +02:00
Avi Kivity
88e69701bd Merge "Fix memory leak on zone reclaim" from Tomek
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120"

* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
  tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
  lsa: Expose max_zone_segments for tests
  lsa: Expose tracker::non_lsa_used_space()
  lsa: Fix memory leak on zone reclaim

(cherry picked from commit 4ad212dc01)
2018-01-16 15:55:09 +02:00
Takuya ASADA
9007b38002 dist/common/systemd: specify correct repo file path for housekeeping service on Ubuntu/Debian
Currently scylla-housekeeping-daily.service/-restart.service hardcoded
"--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file,
but we use same .service for Ubuntu/Debian.
It doesn't work correctly, we need to specify .list file for Debian variants.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit c2e87f4677)
2017-12-16 22:05:38 +02:00
Glauber Costa
f2e0affcc5 database: delete created SSTables if streaming writes fail
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.

We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.

Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.

Fixes #3062

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
(cherry picked from commit 1aabbc75ab)
2017-12-13 10:26:07 +02:00
Avi Kivity
6fce847000 Update seastar submodule
* seastar f27b240...e23b9b8 (1):
  > rpc: make sure that _write_buf stream is always properly closed

Fixes #3018.
2017-11-26 10:40:23 +02:00
Avi Kivity
f6f91a49cb Update seastar submodule
* seastar 121f468...f27b240 (1):
  > scripts/posix_net_conf.sh: supress unwanted output from get_irqs_one

Fixes #2808.
2017-10-08 16:40:00 +03:00
Tomasz Grabiec
266a45ad1e Update seastar submodule
* seastar b3ef898...121f468 (1):
  > configure: disable exception scalability hack on debug build
2017-09-25 10:13:59 +02:00
Tomasz Grabiec
7d88026f22 tests: row_cache_test: Fix test failure
Broken after 0ac2c388b6, which assigns
empty reader to _delegate on hitting wide partition limit. The test
assumed that the original _delegate will be invoked when the
single-partition reader is asked for the next partition, which is no
longer the case.

Message-Id: <20170912172739.6851-1-tgrabiec@scylladb.com>
2017-09-12 20:33:10 +03:00
Duarte Nunes
760af5635d tests: Remove sstable_assertions
The test using these assertions has been removed, and the
infrastructure required for them to work is absent from 1.7.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170912113714.24223-1-duarte@scylladb.com>
2017-09-12 14:41:44 +03:00
Duarte Nunes
8c18bfa8d6 sstable_mutation_test: Remove promoted index monotonicity test
The infrastructure this test relies on is not present in 1.7, so
just remove the test as backporting the required changes would be a
risky, non-trivial effort.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170912081304.10116-1-duarte@scylladb.com>
2017-09-12 11:18:05 +03:00
Avi Kivity
04e3785f77 Update seastar submodule
* seastar 688bb6f...b3ef898 (1):
  > build: fix bad merge artifacts
2017-09-11 16:27:52 +03:00
Avi Kivity
e00e6ad1b6 Update seastar submodule
* seastar 949b710...688bb6f (2):
  > build: export full cflags in pkgconfig file
  > build: disable -Wattributes when gcc -fvisibility=hidden bug strikes

Fixes build with gcc 6.4/gcc 7.
2017-09-11 15:41:52 +03:00
Pekka Enberg
5653ea9f8d release: prepare for 1.7.5 2017-09-11 14:03:12 +03:00
Avi Kivity
4dbd1b77cd Merge "Fix Scylla upgrades when counters are used" from Paweł
"A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless
upgrade from 1.7.4 to later Scylla versions. If that feature is not
available Scylla still writes sstables and sends on-wire counters using
the old ordering so that it can be correctly understood by 1.7.4, once
the flag becomes available Scylla switches to the correct order.

Fixes #2752."

* tag 'fix-upgrade-with-counters-1.7/v1' of https://github.com/pdziepak/scylla:
  tests/counter: verify counter_id ordering
  counter: check that utils::UUID uses int64_t
  mutation_partition_serializer: use old counter ordering if necessary
  mutation_partition_view: do not expect counter shards to be sorted
  sstables: write counter shards in the order expected by the cluster
  tests/sstables: add storage_service_for_tests to counter write test
  tests/sstables: add test for reading wrong-order counter cells
  sstables: do not expect counter shards to be sorted
  storage_service: introduce CORRECT_COUNTER_ORDER feature
  tests/counter: test 1.7.4 compatible shard ordering
  counters: add helper for retrieving shards in 1.7.4 order
  tests/counter: add tests for 1.7.4 counter shard order
  counters: add counter id comparator compatible with Scylla 1.7.4
  tests/counter: verify order of counter shards
  tests/counter: add test for sorting and deduplicating shards
  counters: add function for sorting and deduplicating counter cells
  counters: add more comparison operators
2017-09-11 13:27:01 +03:00
Paweł Dziepak
0e61212c20 tests/counter: verify counter_id ordering 2017-09-05 13:49:01 +01:00
Paweł Dziepak
6f4bc82b6e counter: check that utils::UUID uses int64_t 2017-09-05 13:49:01 +01:00
Paweł Dziepak
c1a30d3f60 mutation_partition_serializer: use old counter ordering if necessary
Until the cluster is fully upgraded from a version that uses the
incorrect counter shard ordering it is essential to keep using it lest
the old nodes corrupt the data upon receiving mutations with a counter
shard ordering they do not expect.
2017-09-05 13:49:01 +01:00
Paweł Dziepak
cbad33033f mutation_partition_view: do not expect counter shards to be sorted 2017-09-05 13:49:01 +01:00
Paweł Dziepak
1f31be9ba3 sstables: write counter shards in the order expected by the cluster
If the feature signaling that we have switched to the correct ordering
of counter shards is not enabled it means that the user still can do a
rollback to a version that expects wrong ordering. In order to avoid any
disasters when that happens write sstables using the 1.7.4 order until
we know for sure that it is no longer needed.
2017-09-05 13:49:01 +01:00
Paweł Dziepak
7e89dc3bbf tests/sstables: add storage_service_for_tests to counter write test
Writing a counters to a sstable is going to require cluster feature
information, which requires accessing some singletons.
2017-09-05 13:49:01 +01:00
Paweł Dziepak
2cdcaeba6e tests/sstables: add test for reading wrong-order counter cells 2017-09-05 13:49:01 +01:00
Paweł Dziepak
55cb0cafa8 sstables: do not expect counter shards to be sorted 2017-09-05 13:49:01 +01:00
Paweł Dziepak
660572e85c storage_service: introduce CORRECT_COUNTER_ORDER feature
Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix
this problem a new feature is introduced that will be used to determine
when nodes with that bug fixed can start sending counter shard in the
correct order.
2017-09-05 13:49:01 +01:00
Paweł Dziepak
b86da0c479 tests/counter: test 1.7.4 compatible shard ordering 2017-09-05 13:49:01 +01:00
Paweł Dziepak
b1b8599b1a counters: add helper for retrieving shards in 1.7.4 order 2017-09-05 13:49:00 +01:00
Paweł Dziepak
89c037dfc8 tests/counter: add tests for 1.7.4 counter shard order 2017-09-05 13:49:00 +01:00
Paweł Dziepak
25eec66935 counters: add counter id comparator compatible with Scylla 1.7.4 2017-09-05 13:49:00 +01:00
Paweł Dziepak
b5787ca640 tests/counter: verify order of counter shards 2017-09-05 13:49:00 +01:00
Paweł Dziepak
838dbd98ac tests/counter: add test for sorting and deduplicating shards 2017-09-05 13:49:00 +01:00
Paweł Dziepak
022c2ff53a counters: add function for sorting and deduplicating counter cells
Due to a bug in an implementation of UUID less compare some Scylla
versions sort counter shards in an incorrect order. Moreover, when
dealing with imported correct data the inconsistencies in ordering
caused some counter shards to become duplicated.
2017-09-05 13:49:00 +01:00
Paweł Dziepak
b7c27d73d8 counters: add more comparison operators 2017-09-05 13:49:00 +01:00
Vlad Zolotarov
bdc0ca7064 service::storage_service: initialize auth and tracing after we joined the ring
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.

This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.

Fixes #2273

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit e98adb13d5)
2017-08-30 09:33:33 +02:00
Calle Wilund
34260ce471 utils::UUID: operator< should behave as comparison of hex strings/bytes
I.e. need to be unsigned comparison.
Message-Id: <1487683665-23426-1-git-send-email-calle@scylladb.com>

(cherry picked from commit 0d87f3dd7d)
2017-08-24 14:18:55 +01:00
Avi Kivity
cffe57bcc7 Merge "repair: Do not allow repair until node is in NORMAL status" from Asias
Fixes #2723.

* tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev:
  repair: Do not allow repair until node is in NORMAL status
  gossip: Add is_normal helper

(cherry picked from commit 2f41ed8493)
2017-08-23 09:45:54 +03:00
Paweł Dziepak
adb9ce7f38 lsa: avoid unnecessary segment migrations during reclaim
segment_zone::migrate_all_segments() was trying to migrate all segments
inside a zone to the other one hoping that the original one could be
completely freed. This was an attempt to optimise for throughput.

However, this may unnecesairly hurt latency if the zone is large, but
only few segments are required to satisfy reclaimer's demands.
Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>

(cherry picked from commit 0318dccafd)
2017-08-22 09:29:05 +02:00
Tomasz Grabiec
5f1fd7a0b1 schema_registry: Ensure schema_ptr is always synced on the other core
global_schema_ptr ensures that schema object is replicated to other
cores on access. It was replicating the "synced" state as well, but
only when the shard didn't know about the schema. It could happen that
the other shard has the entry, but it's not yet synced, in which case
we would fail to replicate the "synced" state. This will result in
exception from mutate(), which rejects attempts to mutate using an
unsynced schema.

The fix is to always replicate the "synced" state. If the entry is
syncing, we will preemptively mark it as synced earlier. The syncing
code is already prepared for this.

Refs #2617.
Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 65c64614aa)
2017-08-17 17:15:12 +02:00
Avi Kivity
d1f06633e0 Update seastar submodule
* seastar a4d924e...949b710 (1):
  > fstream: do not ignore unresolved future

Fixes #2697.
2017-08-16 15:12:45 +03:00
Avi Kivity
b54ea3f6cf dist: use correct repository for third-party RPMs 2017-08-16 11:24:42 +03:00
Avi Kivity
63fd65414a Update seastar submodule
* seastar e5825b5...a4d924e (1):
  > Merge "Fix crash in rpc due to access to already destroyed server socket" from Gleb

Fixes #2690
2017-08-14 16:25:03 +03:00
Avi Kivity
9790c2d229 Update seastar submodule
* seastar 8d9fd92...e5825b5 (1):
  > tls: Only recurse once in shutdown code

Fixes #2691
2017-08-14 15:12:01 +03:00
Raphael S. Carvalho
7728a8dec5 sstables: close index file when sstable writer fails
index's file output stream uses write behind but it's not closed
when sstable write fails and that may lead to crash.
It happened before for data file (which is obviously easier to
reproduce for it) and was fixed by 0977f4fdf8.

Fixes #2673.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>
(cherry picked from commit dddbd34b52)
2017-08-08 09:59:10 +03:00
Duarte Nunes
1fd4a3ed34 tests/sstable_mutation_test: Don't use moved-from object
Fix a bug introduced in dbbb9e93d and exposed by gcc6 by not using a
moved-from object. Twice.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170802161033.4213-1-duarte@scylladb.com>
(cherry picked from commit 4c9206ba2f)
2017-08-03 09:46:33 +03:00
Avi Kivity
0b48863a7e Merge "Ensure correct EOC for PI block cell names" from Duarte
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.

Fixes #2333"

* 'pi-cell-name/v1' of github.com:duarten/scylla:
  tests/sstable_mutation_test: Test promoted index blocks are monotonic
  sstables: Consider eoc when flushing pi block
  sstables: Extract out converting bound_kind to eoc

(cherry picked from commit db7329b1cb)
2017-08-01 18:13:19 +03:00
Gleb Natapov
aec94b926c cql transport: run accept loop in the foreground
It was meant to be run in the foreground since it is waited upon during
stop(), but as it is now from the stop() perspective it is completed
after first connection is accepted.

Fixes #2652

Message-Id: <20170801125558.GS20001@scylladb.com>
(cherry picked from commit 1da4d5c5ee)
2017-08-01 17:07:55 +03:00
Tomasz Grabiec
0ac2c388b6 row_cache: Avoid deadlock/timeout due to sstable read concurrency limit
database::make_sstable_reader() creates a reader which will need to
obtain a semaphore permit when invoked, so that there is a limit on
sstable read concurrency (edeef03). Therefore, each read may create at
most one such reader in order to be guaranteed to make
progress. Otherwise, the creation of the second reader may deadlock
(in case of system tables) or timeout (non-system tables), if enough
number of such readers tries to do the same thing at the same time.

One instance of the problem fixed by this patch is in cache populating
reader (98c12dc) when we reach partition size limit
(max_cached_partition_size_in_kb). In that case population is
abandoned and a second read is created, while still keeping the old
one alive. We saw this causing deadlocks during schema tables parsing
when system.schema_columns contained large partitions. Fixes #2623.

Another case when this can potentially happen is when populating
readers are recreated by cache. We replace the reader there, but using
assignment, so the old reader is still alive when the new one is
created. This patch fixes two out of three of such cases. The third
one (in a scanning read) is not that easy to fix. That problem doesn't
exist in version 2.0 and master, where the cache is reworked for row
granularity.

Refs #2644.

Message-Id: <1501160300-18097-1-git-send-email-tgrabiec@scylladb.com>
2017-08-01 12:10:39 +03:00
Takuya ASADA
09ac5b57aa dist/redhat: limit metapackage dependencies to specific version of scylla packages
When we install scylla metapackage with version (ex: scylla-1.7.1),
it just always install newest scylla-server/-jmx/-tools on the repo,
instead of installing specified version of packages.

To install same version packages with the metapackage, limited dependencies to
current package version.

Fixes #2642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170726193321.7399-1-syuu@scylladb.com>
(cherry picked from commit 91a75f141b)
2017-07-27 14:22:06 +03:00
Shlomi Livne
ff643e3e40 release: prepare for 1.7.4
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-07-26 17:26:33 +03:00
Asias He
a7b8d89de8 gossip: Fix nr_live_nodes calculation
We need to consider the _live_endpoints size. The nr_live_nodes should
not be larger than _live_endpoints size, otherwise the loop to collect
the live node can run forever.

It is a regression introduced in commit 437899909d
(gossip: Talk to more live nodes in each gossip round).

Fixes #2637

Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>
(cherry picked from commit 515a744303)
2017-07-26 16:49:11 +03:00
Duarte Nunes
013fa3da14 schema: Calculate default validator
Fixes #2605

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170719105131.21455-3-duarte@scylladb.com>
2017-07-20 10:58:29 +02:00
Duarte Nunes
259cfaf8f9 thrift: Set default validator for static CFs
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170719105131.21455-2-duarte@scylladb.com>
2017-07-20 10:58:29 +02:00
Duarte Nunes
6501bf8e54 schema_tables: Recover comparator type
Fixes #2573

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718125450.3727-1-duarte@scylladb.com>
2017-07-19 10:58:43 +02:00
Pekka Enberg
41b4055911 release: prepare for 1.7.3 2017-07-18 17:34:46 +03:00
Nadav Har'El
b594f21f91 Allow reading exactly desired byte ranges and fast_forward_to
Allow reading exactly desired byte ranges and fast_forward_to

In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve perforance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170718110643.8667-1-nyh@scylladb.com>
2017-07-18 16:54:11 +03:00
Avi Kivity
bcd2e6249f dist: tolerate sysctl failures
sysctl may fail in a container environment if /proc is not virtualized
properly.

Fixes #1990
Message-Id: <20170625145930.31619-1-avi@scylladb.com>

(cherry picked from commit 08488a75e0)
2017-07-18 15:47:10 +03:00
Takuya ASADA
4c79add7b0 dist/debian: skip tunables when kernel = 3.13.0-*-generic, to prevent kernel panic bug
There is kernel panic bug on kernel = 3.13.0-*-generic(Ubuntu 14.04), we have to skip tunables.

Fixes #1724

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493196636-25645-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit abf65cb485)
2017-07-18 15:47:03 +03:00
Asias He
00f6ccb75d gossip: Implement the missing fd_max_interval_ms and fd_initial_value_ms option
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.

Fixes #2603.

Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
(cherry picked from commit adc5f0bd21)
2017-07-17 13:30:34 +03:00
Avi Kivity
77ac5a63db Update seastar submodule
* seastar fc69677...8d9fd92 (1):
  > rpc: start server's send loop only after protocol negotiation

Fixes #2600.
2017-07-17 10:43:12 +03:00
Pekka Enberg
eb9de1a807 Merge "Repair backport for 1.7 branch" from Asias
"This series backports all the repair related fixes to enterprise branch and
 updates the scylla_repair to send ranges to repair to all the shards in
 parallel, indepedently.

 With this series, repair can utilize all the CPUs and is much more efficent."

* tag 'asias/repair-backport-branch-1.7.3-v1' of github.com:cloudius-systems/seastar-dev:
  repair: Use selective_token_range_sharder
  tests: Add test_selective_token_range_sharder
  dht: Add selective_token_range_sharder
  repair: further limit parallelism of checksum calculation
  repair: Do not store the failed ranges
  repair: Prefer nodes in local dc when streaming
  repair: Repair on all shards
  repair: Allow one stream plan in flight
2017-07-14 13:02:26 +03:00
Duarte Nunes
643a777067 storage_proxy: Preserve replica order across mutations
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.

There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.

This patch fixes this and enforces the expected order.

Fixes #2531
Fixes #2593

Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
(cherry picked from commit b8235f2e88)
2017-07-14 12:12:09 +03:00
Avi Kivity
6f91939650 Update seastar submodule
* seastar 8e2f629...fc69677 (1):
  > tls: Wrap all IO in semaphore (Fixes #2575)
2017-07-12 10:24:04 +03:00
Gleb Natapov
15da71266d consistency_level: report less live endpoints in Unavailable exception if there are pending nodes
DowngradingConsistencyRetryPolicy uses live replicas count from
Unavailable exception to adjust CL for retry, but when there are pending
nodes CL is increased internally by a coordinator and that may prevent
retried query from succeeding. Adjust live replica count in case of
pending node presence so that retried query will be able to proceed.

Fixes #2535

Message-Id: <20170710085238.GY2324@scylladb.com>
(cherry picked from commit 739dd878e3)
2017-07-11 17:16:58 +03:00
Botond Dénes
9cd36ade00 Fix crash in the out-of order restrictions error msg composition
Use name of the existing preceeding column with restriction
(last_column) instead of assuming that the column right after the
current column already has restrictions.
This will yield an error message that is different from that of
Cassandra, albeit still a correct one.

Fixes #2421

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>
(cherry picked from commit 33bc62a9cf)
2017-07-11 17:16:01 +03:00
Asias He
6f58a1372e repair: Use selective_token_range_sharder
With this change, we ask all the shard to handle the ranges provided by
user and we use selective_token_range_sharder to split the ranges and
ignore the ranges do not belong to the current shard.

(cherry picked from commit b10e961a64)

 Conflicts:
	repair/repair.cc
2017-07-11 08:40:49 +08:00
Asias He
0a9d26de4a tests: Add test_selective_token_range_sharder
(cherry picked from commit 2a794db61b)
2017-07-11 08:40:49 +08:00
Asias He
35cd63e1f7 dht: Add selective_token_range_sharder
It is like ring_position_range_sharder but it works with
dht::token_range. This sharder will return the ranges belong to a
selected shard.

(cherry picked from commit d835cf2748)
2017-07-11 08:40:49 +08:00
Nadav Har'El
2ada799e07 repair: further limit parallelism of checksum calculation
Repair today has a semaphore limiting the number of ongoing checksum
comparisons running in parallel (on one shard) to 100. We needed this
number to be fairly high, because a "checksum comparison" can involve
high latency operations - namely, sending an RPC request to another node
in a remote DC and waiting for it to calculate a checksum there, and while
waiting for a response we need to proceed calculating checksums in parallel.

But as a consequence, in the current code, we can end up with as many as
100 fibers all at the same stage of reading partitions to checksum from
sstables. This requires tons of memory, to hold at least 128K of buffer
(even more with read-ahead) for each of these fibers, plus partition data
for each. But doing 100 reads in parallel is pointless - one (or very few)
should be enough.

So this patch adds another semaphore to limit the number of checksum
*calculations* (including the read and checksum calculation) on each shard
to just 2. There may still be 100 ongoing checksum *comparisons*, in
other stages of the comparisons (sending the checksum requests to other
and waiting for them to return), but only 2 will ever be in the stage of
reading from disk and checksumming them.

The limit of 2 checksum calculations (per shard) applies on the repair
slave, not just to the master: The slave may receive many checksum
requests in parallel, but will only actually work on 2 at a time.

Because the parallelism=100 now rate-limits operations which use very little
memory, in the future we can safely increase it even more, to support
situations where the disk is very fast but the link between nodes has
very high latency.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170703151329.25716-1-nyh@scylladb.com>
(cherry picked from commit d177ec05cb)
2017-07-11 08:40:49 +08:00
Asias He
b71037ac55 repair: Do not store the failed ranges
The number of failed ranges can be large so it can consume a lot of memory.
We already logged the failed ranges in the log. No need to storge them
in memory.

Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com>
(cherry picked from commit b2a2fbcf73)

 Conflicts:
	repair/repair.cc
2017-07-11 08:40:49 +08:00
Asias He
8639f32efd repair: Prefer nodes in local dc when streaming
When peer nodes have the same partition data, i.e., with the same
checksum, we currently choose to stream from any of them randomly.
To improve streaming performance, select the peer within the same DC.
This patch is supposed to improve repair perforamnce with multiple DC.

Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>
(cherry picked from commit cc02a62756)
2017-07-11 08:40:48 +08:00
Asias He
a0dce7c922 repair: Repair on all shards
Currently, shard zero is the coordinator of the repair. All the work of
checksuming of the local node and sending of the repair checksum rpc
verb is done on shard zero only. This causes other shards being
underutilized.

With this patch, we split the ranges need to be repaired into at least
smp::count ranges, so sizeof(ranges) / smp::count will be assigned to
each shard. For exmaple, we have 8 shards and 256 ragnes, each shard
will repair 32 ranges. Each shard will repair the 32 ranges
sequencially.  There will be at most 8 (smp::count) ranges of repair in
parallel.

(cherry picked from commit 47345078ec)

Conflicts:
	repair/repair.cc
2017-07-11 08:40:48 +08:00
Asias He
d39ff4f2ac repair: Allow one stream plan in flight
In "repair: Use more stream_plan" (commit 2043ffc064), we
switched to do stream while doing checksum instead of do stream only
after checksum pahse is completed. We take a parallelism_semaphore
before we do checksum, if there are more than sub_ranges_to_stream
(1024) ranges, we start a stream_plan and wait for the streaming to
complete (still under the parallelism_semaphore). So at most
parallelism_semaphore (100) stream_plans can be in parallel.

The parallelism_semaphore limits the parallelism of both checksum and the
streaming plan. However, it is not necessary to have the same
parallelism for both checksum and streaming, because 1) a streaming
operation itself runs in parallel (handling ranges on all shards in
prallel, sending mutaitons in parallel) , 2) and with more streaming plan
(in worse case 100) means we can write to 100 memtables at the same time
and flush 100 memtables to disk at the same time which can take a lot of
memory.

With this patch, we only allow one stream plan in flight.

(cherry picked from commit 54831a344c)
2017-07-11 08:40:48 +08:00
Avi Kivity
7cbfe0711f dist: redirect stdout/stderr to the journal on systemd systems
Fixes #2408.

Message-Id: <20170524080729.10085-1-avi@scylladb.com>
(cherry picked from commit 15af6acc8b)
2017-07-10 19:31:14 +03:00
Glauber Costa
139a2d14a1 disable defragment-memory-on-idle-by-default
It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.

Until we can figure out how to reduce its impact, we should disable it.

Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
(cherry picked from commit f3742d1e38)
2017-07-10 19:25:12 +03:00
Asias He
6fff331698 gossip: Use vector for _live_endpoints
To speed up the random access in get_random_node. Switch to use vector
instead of set.

(cherry picked from commit e31d4a3940)
Message-Id: <fea90eaa5273fac50d0013b3778d9a4f2562e0b7.1499394330.git.asias@scylladb.com>
2017-07-10 14:42:26 +03:00
Asias He
43ae64cd47 gossip: Talk to more live nodes in each gossip round
In large clusters with multiple DC deployment, it is observed that it
takes long delay for gossip update to disseminate in the cluster.

To speed up, talk to more live nodes in each gossip round.

Fixes #2528

(cherry picked from commit 437899909d)
Message-Id: <9bcdaf1fb5637d14a7fda9188ba76ced8f1afaaf.1499394330.git.asias@scylladb.com>
2017-07-10 14:40:40 +03:00
Tomasz Grabiec
f306b47a88 tests: commitlog: Check there are no segments left on disk after clean shutdown
Reproduces #2550.

Message-Id: <1499358825-17855-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 72e01b7fe8)
2017-07-10 12:41:33 +03:00
Tomasz Grabiec
47b1e39410 commitlog: Discard active but unused segments on shutdown
So that they are not left on disk even though we did a clean shutdown.

First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.

Fixes #2550.

Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 6555a2f50b)
2017-07-10 12:40:43 +03:00
Botond Dénes
0f4d5cde8e cql3: Add K_FROZEN and K_TUPLE to basic_unreserved_keyword
To allow the non-reserved keywords "frozen" and "tuple" to be used as
column names without double-quotes.

Fixes #2507

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9ae17390662aca90c14ae695c9b4a39531c6cde6.1499329781.git.bdenes@scylladb.com>
(cherry picked from commit c4277d6774)
2017-07-06 18:19:59 +03:00
Avi Kivity
a24dcf1a19 Update seastar submodule
* seastar 18a82e2...8e2f629 (1):
  > future-utils: fix do_for_each exception reporting

Fixes bug during a failed repair.
2017-07-06 17:32:37 +03:00
Raphael S. Carvalho
611c25234e database: fix potential use-after-free in sstable cleanup
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.

Let's fix it with a do_with, also making code nicer.

Fixes #2537.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
(cherry picked from commit b9d0645199)
2017-07-03 12:49:34 +03:00
Amos Kong
f64e3e24d4 common/scripts: fix node_exporter url
Commit ff3d83bc2f updated node_exporter
from 0.12.0 to 0.14.0, and it introduced a bug to download install file.

node_exporter started to add 'v' prefix in release tags[1] from 0.13.0,
so we need to fix the url.

[1] https://github.com/prometheus/node_exporter/tags

Fixes #2509

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <42b0a7612539a34034896d404d63a0a31ce79e10.1497919368.git.amos@scylladb.com>
(cherry picked from commit 92731eff4f)
2017-06-22 08:51:35 +03:00
Shlomi Livne
f6034c717d release: prepare 1.7.2
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-06-21 22:09:31 +03:00
Amos Kong
b6f4df3cc8 scylla_setup: fix deadloop in inputting invalid option
example: # scylla_setup --invalid-opt

Fixes #2305

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <9a4f631b126d8eaaae479fa99137db7a61a7c869.1493135357.git.amos@scylladb.com>
(cherry picked from commit f655639e5a)
2017-06-19 22:32:38 +03:00
Amnon Heiman
af028360d7 node_exporter_install script update version to 0.14
Fixes #2097

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170612125724.7287-1-amnon@scylladb.com>
(cherry picked from commit ff3d83bc2f)
2017-06-18 12:28:19 +03:00
Duarte Nunes
60af7eab10 udt: Don't check a type is unused after applying the schema mutations
This patch is based on 6c8b5fc. It moves the check whether a dropped
type is still used by other types or tables from schema_tables to
the drop_type_statement, as delaying this check to after applying the
mutations can leave the keyspace in a broken state.

Fixes #2490

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497466736-28841-1-git-send-email-duarte@scylladb.com>
2017-06-15 10:35:01 +03:00
Calle Wilund
665d14584c database: Fix assert in truncate to handle empty memtables+sstables
If we do two truncates in a row, the second will have neither memtable
nor sstable data. Thus we will not write/remove sstables, and thus
get no resulting truncation replay position.

Fixes #2489

Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>

(cherry picked from commit 525730e135)
2017-06-14 16:25:57 +03:00
Gleb Natapov
bb56e7682c Fix use after free in nonwrapping_range::intersection
end_bound() returns temporary object (end_bound_ref), so it cannot be
taken by reference here and used later. Copy instead.

Message-Id: <20170612132328.GJ21915@scylladb.com>

(cherry picked from commit 21197981a)

Fixes #2482
2017-06-14 12:08:06 +01:00
Avi Kivity
a4bd56ce40 tests: fix partitioner_test build on gcc 5 2017-06-13 21:56:02 +03:00
Calle Wilund
6340fe61af commitlog_test: Fix test_commitlog_delete_when_over_disk_limit
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
    end and inbetwen. I.e. the test sort of assumed we started
    with < 2 (or so) segments. Not always the case (timing)

Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 0c598e5645)
2017-06-13 19:53:13 +03:00
Asias He
f2317a6f3f repair: Fix range use after free
Capture it by value.

scylla:  [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed)
scylla:  [shard 0] repair - Failed sync of range ==<runtime_exception
(runtime error: Invalid token. Should have size 8, has size 0#012)>: streaming::stream_exception (Stream failed)

Message-Id: <7fda4432e54365f64b556e7e4c26e36d3a9bb1b7.1497238229.git.asias@scylladb.com>
(cherry picked from commit 2bcb368a13)
2017-06-13 11:03:14 +03:00
Paweł Dziepak
7bb41b50f9 commitlog: avoid copying column_mapping
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.

This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.

Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
          before       after      diff
write   79434.35    89247.54    +12.3%

(cherry picked from commit 374c8a56ac)

Also: Fixes #2468.
2017-06-11 15:44:20 +03:00
Paweł Dziepak
57d602fdd6 idl: fix generated writers when member functions are used
When using member name in an idetifer of generated class or method
idl compiler should strip the trailing '()'.

(cherry picked from commit 4df4994b71)

(part of #2468)
2017-06-11 15:43:53 +03:00
Paweł Dziepak
cd14b83192 idl: add start_frame() overload for seastar::simple_output_stream
(cherry picked from commit 018d16d315)

(part of #2468)
2017-06-11 15:43:11 +03:00
Avi Kivity
a85b70d846 Merge "repair memory usage fix" from Asias
"This series switches repair to use more stream plans to stream the mismatched
sub ranges and use a range generator to produce sub ranges.

Test shows no huge memory is used for repair with large data set.

In addition, we now have a progress reporter in the log how many ranges are processed.

   Jun 06 14:18:22  [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
   Jun 06 14:19:55  [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]

Fixes #2430."

* tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev:
  repair: Remove unused sub_ranges_max
  repair: Reduce parallelism in repair_ranges
  repair: Tweak the log a bit
  repair: Use more stream_plan
  repair: iterator over subranges instead of list

(cherry picked from commit 419ad9d6cb)
2017-06-08 14:52:28 +03:00
Avi Kivity
f44ea5335b Update seastar submodule
* seastar 812e232...18a82e2 (1):
  > scripts: posix_net_conf.sh: fix bash syntax causing a failure during bonding iface configuration

Fixes #2269
2017-06-07 18:23:02 +03:00
Pekka Enberg
a95c045b48 Merge "Fixes to thrift/server" from Duarte
"This series fixes some issues with the thrift_server, namely
ensuring that streams and sockets are properly closed.

Fixes #499
Fixes #2437"

* 'thrift-server-fixes/v1' of github.com:duarten/scylla:
  thrift/server: Close connections when stopping server
  thrift/server: Move connection class to header
  thrift/server: Shutdown connection
  thrift/server: Close output_stream when connection is done

(cherry picked from commit a6dc21615b)
2017-06-07 16:08:28 +03:00
Avi Kivity
eb396d2795 Update seastar submodule
* seastar 328fdbc...812e232 (1):
  > rpc: handle messages larger than memory limit

Fixes #2453.
2017-06-07 12:29:59 +03:00
Takuya ASADA
dbbf99d7fa dist/debian: install gdebi when it's not exist
Since we started to use gdebi for install build-dep metapackage that generated by
mk-build-dep, we need to install gdebi on build_deb.sh too.

Fixes #2451

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496819209-30318-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 7fe63c539a)
2017-06-07 10:25:02 +03:00
Raphael S. Carvalho
f7a143e7be sstables: fix report of disk space used by bloom filter
After change in boot, read_filter is called by distributed loader,
so its update to _filter_file_size is lost. The load variant
which receives foreign components that must do it. We were also
not updating it for newly created sstables.

Fixes #2449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170606151129.5477-1-raphaelsc@scylladb.com>
(cherry picked from commit 0ca1e5cca3)
2017-06-06 19:00:00 +03:00
Takuya ASADA
562102cc76 dist/debian: use gdebi instead of mk-build-deps -i
At least on Debian8, mk-build-deps -i silently finishes with return code 0
even it fails to install dependencies.
To prevent this, we should manually install the metapackage generated by
mk-build-deps using gdebi.

Fixes #2445

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496737502-10737-2-git-send-email-syuu@scylladb.com>
(cherry picked from commit a4c392c113)
2017-06-06 14:18:14 +03:00
Takuya ASADA
d4b444418a dist/debian/dep: install texlive from jessie-backports to prevent gdb build fail on jessie
Installing openjdk-8-jre-headless from jessie-backports breaks texlive on
jessie main repo.
It causes 'Unmet build dependencies' error when building gdb package.
To prevent this, force insatlling texlive from jessie-backports before start
building gdb.

Fixes #2444

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496737502-10737-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 5608842e96)
2017-06-06 14:18:08 +03:00
Raphael S. Carvalho
befd4c9819 db: fix computation of live disk usage stat after compaction
sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.

Fixes #1592

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
(cherry picked from commit 3b5ad23532)
2017-05-28 10:39:14 +03:00
Avi Kivity
eb2fe0fbd3 Merge "reduce memory requirement for loading sstables" from Rapahel
"fixes a problem in which memory requirement for loading in-memory
components of sstables is very high due to unlimited parallelism."

* 'mem_requirement_sstable_load_v2_2' of github.com:raphaelsc/scylla:
  database: fix indentation of distributed_loader::open_sstable
  database: reduce memory requirement to load sstables
  sstables: loads components for a sstable in parallel
  sstables: enable read ahead for read of in-memory components
  sstables: make random_access_reader work with read ahead

(cherry picked from commit ef428d008c)
2017-05-25 12:59:55 +03:00
Raphael S. Carvalho
eb6b0b1267 db: remove partial sstable created by memtable flush which failed
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.

Fixes #2407.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
(cherry picked from commit b7e1575ad4)
2017-05-25 11:50:17 +03:00
Asias He
7836600ded streaming: Do not abort session too early in idle detection
Streaming ususally takes long time to complete. Abort it on false
positive idle detection can be very wasteful.

Increase the abort timeout from 10 minutes to a very large timeout, 300
minutes. The real idle session will be aborted eventually if other
mechanisms, e.g., streaming manager has gossip callback for on_remove
and on_restart event to abort, do not abort the session.

Fixes #2197

Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>
(cherry picked from commit f792c78c96)
2017-05-24 12:30:47 +03:00
Shlomi Livne
230c33da49 release: prepare for 1.7.1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-05-23 22:42:52 +03:00
Raphael S. Carvalho
17d8a0c727 compaction: do not write expired cell as dead cell if it can be purged right away
When compacting a fully expired sstable, we're not allowing that sstable
to be purged because expired cell is *unconditionally* converted into a
dead cell. Why not check if the expired cell can be purged instead using
gc before and max purgeable timestamp?

Currently, we need two compactions to get rid of a fully expired sstable
which cells could have always been purged.

look at this sstable with expired cell:
  {
    "partition" : {
      "key" : [ "2" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 120,
        "liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z",
"ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true },
        "cells" : [
          { "name" : "country", "value" : "1" },
        ]

now this sstable data after first compaction:
[shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79
(~65% of original) in 229ms = 0.000328997MB/s.

  {
    ...
    "rows" : [
      {
        "type" : "row",
        "position" : 79,
        "cells" : [
          { "name" : "country", "deletion_info" :
{ "local_delete_time" : "2017-04-09T17:07:12Z" },
            "tstamp" : "2017-04-09T17:07:12.702597Z"
          },
        ]

now another compaction will actually get rid of data:
compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original)
in 1ms = 0MB/s. ~2 total partitions merged to 0

NOTE:
It's a waste of time to wait for second compaction because the expired
cell could have been purged at first compaction because it satisfied
gc_before and max purgeable timestamp.

Fixes #2249, #2253

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>
(cherry picked from commit a6f8f4fe24)
2017-05-23 20:57:54 +03:00
Tomasz Grabiec
064de6f8de row_cache: Fix undefined behavior in read_wide()
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.

Fixes #2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0351ab8bc6)
2017-05-21 19:09:03 +03:00
Gleb Natapov
df56c108b7 database: remove temporary sstables sequentially
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.

Fixes #2384

Message-Id: <20170516103018.GQ3874@scylladb.com>
(cherry picked from commit c7ad3b9959)
2017-05-21 18:56:22 +03:00
Tomasz Grabiec
25607ab9df range: Fix SFINAE rule for picking the best do_lower_bound()/do_upper_bound() overload
mutation_partition has a slicing constructor which is supposed to copy
only the rows from the query range. The rows are located using
nonwrapping_range::lower_bound() and
nonwrapping_range::lower_bound(). Those two have two different
implementations chosen with SFINAE. One is using std::lower_bound(),
and one is using container's built in lower_bound() should it
exist. We're using intrusive tree in mutation_partition, so
container's lower_bound() is preferred. It's O(log N) whereas
std::lower_bound() is O(N), because tree's iterator is not random
access.

However, the current rule for picking container's lower_bound() never
triggers, because lower_bound() has two overloads in the container:

  ./range.hh:618:14: error: decltype cannot resolve address of overloaded function
              typename = decltype(&std::remove_reference<Range>::type::upper_bound)>
              ^~~~~~~~

As a result, the overload which uses std::lower_bound() is used.

Spotted when running perf_fast_forward with wide partition limit in
cache lifted off. It's so slow that I timeouted waiting for the result
(> 16 min).

Fixes #2395.

Message-Id: <1495048614-9913-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3fc1703ccf)
2017-05-18 17:12:00 +03:00
Avi Kivity
b26bd8bbeb tests: fix partitioner_test for g++ 5
It can't make the leap from dht::ring_position to
stdx::optional<range_bound<dht::ring_position>> for some reason.

(cherry picked from commit ba31619594)
2017-05-18 13:10:48 +03:00
Avi Kivity
1ca7f5458b Update seastar submodule
> tls: make shutdown/close do "clean" handshake shutdown in background
  > tls: Make sink/source (i.e. streams) first class channel owners
  > native-stack: Make sink/source (i.e. streams) first class channel owners

More close() fixes, pointed out by Tomek.
2017-05-17 19:01:44 +03:00
Calle Wilund
50c8a08e91 scylla: fix compilation errors on gcc 5
Message-Id: <1495030581-2138-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 6ca07f16c1)
2017-05-17 18:04:58 +03:00
Avi Kivity
9d1b9084ed Update seastar submodule
* seastar bfa1cb2...774c09c (1):
  > posix-stack: Make sink/source (i.e. streams) first class channel owners
2017-05-17 16:44:34 +03:00
Tomasz Grabiec
e2c75d8532 Merge "Fix performance problems with high shard counts tag" from Avi
From http://github.com/avikivity/scylla exponential-sharder/v3.

The sharder, which takes a range of tokens and splits it among shards, is
slow with large shard count and the default
murmur3_partitioner_ignore_msb_bits.

This patchset fixes excessive iteration in sstable sharding metadata writer and
nonsignular range scans.

Without this patchset, sealing a memtable takes > 60 ms on a 48-shard
system.  With the patchset, it drops below the latency tracker threshold I
used (5 ms).

Fixes #2392.

(cherry picked from commit 84648f73ef)
2017-05-17 16:19:24 +03:00
Duarte Nunes
59063f4891 tests: Add test case for nonwrapping_range::intersection()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit f365b7f1f7)
2017-05-17 15:59:06 +03:00
Duarte Nunes
de79792373 nonwrapping_range: Add intersection() function
intersection() returns an optional range with the intersection of the
this range and the other, specified range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 1f9359efba)
2017-05-17 15:58:55 +03:00
Avi Kivity
3557b449ac Merge "Adding private repository to housekeeping" from Amnon
"This series adds private repository support to scylla-housekeeping"

* 'amnon/housekeeping_private_repo_v3' of github.com:cloudius-systems/seastar-dev:
  scylla-housekeeping service: Support private repositories
  scylla-housekeeping-upstart: Use repository id, when checking for version
  scylla-housekeeping: support private repositories

(cherry picked from commit eb69fe78a4)
2017-05-17 15:58:29 +03:00
Pekka Enberg
a8e89d624a cql3: Fix variable_specifications class get_partition_key_bind_indexes()
The "_specs" array contains column specifications that have the bind
marker name if there is one. That results in
get_partition_key_bind_indices() not being able to look up a column
definition for such columns. Fix the issue by keeping track of the
actual column specifications passed to add() like Cassandra does.

Fixes #2369

(cherry picked from commit a45e656efb4c6478d80e4dfc18de99b94712eeba)
2017-05-10 10:00:47 +03:00
Pekka Enberg
31cd6914a8 cql3: Move variable_specifications implementation to source file
Move the class implementation to source file to reduce the need to
recompile everything when the implementation changes...

Message-Id: <1494312003-8428-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 5b931268d4)
2017-05-10 10:00:31 +03:00
Pekka Enberg
a441f889c3 cql3: Fix partition key bind indices for prepared statements
Fix the CQL front-end to populate the partition key bind index array in
result message prepared metadata, which is needed for CQL binary
protocol v4 to function correctly.

Fixes #2355.

(cherry picked from commit ebd76617276e660c590cec0a07e97e82422111df)

Tested-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <1494257274-1189-1-git-send-email-penberg@scylladb.com>
2017-05-10 10:00:21 +03:00
Pekka Enberg
91b7cb8576 Merge "gossip mark alive fixes" from Asias
"This series fixes the user after free issue in gossip and elimates the
duplicated / unnecessary mark alive operations.

Fixes #2341"

* tag 'asias/gossip_fix_mark_alive/v1' of github.com:cloudius-systems/seastar-dev:
  gossip: Ignore callbacks and mark alive operation in shadow round
  gossip: Ingore the duplicated mark alive operation
  gossip: Fix user after free in mark_alive

(cherry picked from commit 1e04731fa0)
2017-05-09 01:57:23 +03:00
Avi Kivity
2b17c4aacf Merge "Fix update of counter in static rows" from Paweł
"The logic responsible for converting counter updates to counter shards was
not covered by unit tests and didn't transform counter cells inside static
rows.

This series fixes the problem and makes sure that the tests cover both
static rows and transformation logic.

Fixes #2334."

* tag 'pdziepak/static-counter-updates-1.7/v1' of github.com:cloudius-systems/seastar-dev:
  tests/counter: test transform_counter_updates_to_shards
  tests/counter: test static columns
  counters: transform static rows from updates to shards
2017-05-06 15:54:20 +03:00
Pekka Enberg
f61d9ac632 release: prepare for 1.7.0 2017-05-04 15:28:28 +03:00
Asias He
fc9db8bb03 repair: Fix partition estimation
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.

The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.

Fixes #2299

Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
(cherry picked from commit 66e3b73b9c)
2017-05-03 16:26:01 +03:00
Paweł Dziepak
bd67d23927 tests/counter: test transform_counter_updates_to_shards 2017-05-02 13:49:43 +01:00
Paweł Dziepak
bdeeebbd74 tests/counter: test static columns 2017-05-02 13:49:43 +01:00
Paweł Dziepak
a1cb29e7ec counters: transform static rows from updates to shards 2017-05-02 13:49:43 +01:00
Amnon Heiman
e8369644fd scylla_setup: Fix conditional when checking for newer version
During the changes in the way the housekeeping check for newer version
and warn about it in the installation the UUID part was removed but kept
in the sarounding if.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170426075724.7132-1-amnon@scylladb.com>
(cherry picked from commit b59c95359d)
2017-05-01 12:14:04 +03:00
Glauber Costa
a36cabdb30 reduce kernel scheduler wakeup granularity
We set the scheduler wakeup granularity to 500usec, because that is the
difference in runtime we want to see from a waking task before it
preempts the running task (which will usually be Scylla). Scheduling
other processes less often is usually good for Scylla, but in this case,
one of the "other processes" is also a Scylla thread, the one we have
been using for marking ticks after we have abandoned signals.

However, there is an artifact from the Linux scheduler that causes those
preemption to be missed if the wakeup granularity is exactly twice as
small as the sched_latency. Our sched_latency is set to 1ms, which
represents the maximum time period in which we will run all runnable
tasks.

We want to keep the sched_latency at 1ms, so we will reduce the wakeup
granularity so to something slightly lower than 500usec, to make sure
that such artifact won't affect the scheduler calculations. 499.99usec
will do - according to my tests, but we will reduce it to a round
number.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170427135039.8350-1-glauber@scylladb.com>
(cherry picked from commit 14b9aa2285)
2017-05-01 11:13:51 +03:00
Raphael S. Carvalho
1d26fab73e sstables: add method to export ancestors
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-01 11:09:42 +03:00
Shlomi Livne
5f0c635da7 release: prepare for 1.7.rc3
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-05-01 09:53:20 +03:00
Raphael S. Carvalho
82cc3d7aa5 dtcs: do not compact fully expired sstable which ancestor is not deleted yet
Currently, fully expired sstable[1] is unconditionally chosen for compaction
by DTCS, but that may lead to a compaction loop under certain conditions.

Let's consider that an almost expired sstable is compacted, and it's not
deleted yet, and that the new sstable becomes expired before its ancestor is
deleted.
Because this new sstable is expired, it will be chosen by DTCS, but it will
not be purged because 'compacted undeleted' sstables are taken into account
by calculation of max purgeable timestamp and prevents expired data from
being purged. The problem is that this sequence of events can keep happening
forever as reported by issue #2260.
NOTE: This problem was easier to reproduce before improvement on compaction
of expired cells, because fully expired sstable was being converted into a
sstable full of tombstones, which is also considered fully expired.

Fixes #2260.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>
(cherry picked from commit 687a4bb0c2)
2017-04-30 19:36:00 +03:00
Paweł Dziepak
98d782cfe1 db: make virtual dirty soft limit configurable
Message-Id: <20170428150005.28454-1-pdziepak@scylladb.com>
(cherry picked from commit 24f4dcf9e4)
2017-04-30 19:17:55 +03:00
Avi Kivity
ea0591ad3d Merge "] Fix problems with slicing using sstable's promoted index" from Tomasz
"Fixes #2327.
Fixes #2326."

* 'tgrabiec/fix-promoted-index-parsing-1.7' of github.com:cloudius-systems/seastar-dev:
  sstables: Fix incorrect parsing of cell names in promoted index
  sstables: Fix find_disk_ranges() to not miss relevant range tombstones
2017-04-30 14:48:54 +03:00
Paweł Dziepak
7eedd743bf lsa: introduce upper bound on zone size
Attempting to create huge zones may introduce significant latency. This
patch introduces the maximum allowed zone size so that the time spent
trying to allocate and initialising zone is bounded.

Fixes #2335.

Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>
(cherry picked from commit f5cf86484e)
2017-04-30 10:58:34 +03:00
Tomasz Grabiec
8a21961ec9 sstables: Fix incorrect parsing of cell names in promoted index
Range tombstones are serialized to cell names in this place:

  _sst.maybe_flush_pi_block(_out, start, {});

Note that the column set is empty. This is correct. A range tombstone
only has a clustering part. The cell name is deserialized by promoted
index reader using mp_row_consumer::column, like this:

   mp_row_consumer::column col(schema, std::move(col_name),
      api::max_timestamp); return std::move(col.clustering);

The problem is, column constructor assumes that there is always a
component corresponding to a cell name if the table is not dense, and
will pop it from the set of components (the clustering field):

  , cell(!schema.is_dense() ? pop_back(clustering) : (*(schema.regular_begin())).name())

promoted index block which starts or ends with a range tombstone will
appear as having incorrect bounds. This may result in an incorrect
value for data file range start to be calculated.

Fixes #2327.
2017-04-27 18:30:00 +02:00
Tomasz Grabiec
08698d9030 sstables: Fix find_disk_ranges() to not miss relevant range tombstones
Suppose the promoted index looks like this:

block0: start=1 end=2
block1: start=4 end=5

start and end are cell names of the first and last cell in the block.

If there is a range tombstone covering [2,3], it will be only in
block0, because it is no longer in effect when block1 starts. However,
slicing the index for [3, +inf], which intersects with the tombstone,
will yield block1. That's because the slicing looks for a block with
an end which is greater than or equal to the start of the slice:

 if (!found_range_start) {
    if (!range_start || cmp(range_start->value(), end_ck) <= 0) {
       range_start_pos = ie.position() + offset;

We should take into account that any given block may actually contain
information for anything up to the start of the next block, so instead
of using end_ck, effectively use next block's start_ck (exclusive).

Fixes #2326.
2017-04-27 18:30:00 +02:00
Tomasz Grabiec
df5a291c63 sstables: Fix usage of wrong comparator in find_disk_ranges()
This made a difference if clustering restriction bounds were not full
keys but prefixes.

Fixes #2272.

Message-Id: <1493058357-24156-1-git-send-email-tgrabiec@scylladb.com>
2017-04-24 21:56:07 +03:00
Avi Kivity
1a77312aec Merge "Reduce memory reclamation latency" from Tomasz
"Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

Statistics for reclamation pauses for a read workload over
larger-than-memory data set:

Before:

 avg = 865.796362
 stdev = 10253.498038
 min = 93.891000
 max = 264078.000000
 sum = 574022.988000
 samples = 663

After:

 avg = 513.685650
 stdev = 275.270157
 min = 212.286000
 max = 1089.670000
 sum = 340573.586000
 samples = 663

Refs #1634."

* tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev:
  lsa: Reduce reclamation latency
  tests: Add test for log_histogram
  log_histogram: Allow non-power-of-two minimum values
  lsa: Use regular compaction threshold in on-idle compaction
  tests: row_cache_test: Induce update failure more reliably
  lsa: Add getter for region's eviction function

(cherry picked from commit fccbf2c51f)

[avi: adjustments for 1.7's heap vs. master's log_histogram]
2017-04-21 22:12:52 +03:00
Duarte Nunes
ea684c9a3e alter_type_statement: Fix signed to unsigned conversion
This could allow us to alter a non-existing field of an UDT.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170419114254.5582-1-duarte@scylladb.com>
(cherry picked from commit e06bafdc6c)
2017-04-19 14:48:27 +03:00
Raphael S. Carvalho
2df7c80c66 compaction_manager: fix crash when dropping a resharding column family
Problem is that column family field of task wasn't being set for resharding,
so column family wasn't being properly removed from compaction manager.
In addition to fixing this issue, we'll also interrupt ongoing compactions
when dropping a column family, exactly like we do with shutdown.

Fixes #2291.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170418125807.7712-1-raphaelsc@scylladb.com>
(cherry picked from commit e78db43b79)
2017-04-18 17:40:09 +03:00
Raphael S. Carvalho
193b5d1782 partitioned_sstable_set: fix quadratic space complexity
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.

Fixes #2287.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
(cherry picked from commit 11b74050a1)
2017-04-18 13:05:00 +03:00
Asias He
6609c9accb gossip: Fix possible use-after-free of entry in endpoint_state_map
We take a reference of endpoint_state entry in endpoint_state_map. We
access it again after code which defers, the reference can be invalid
after the defer if someone deletes the entry during the defer.

Fix this by checking take the reference again after the defering code.

I also audited the code to remove unsafe reference to endpoint_state_map entry
as much as possible.

Fixes the following SIGSEGV:

Core was generated by `/usr/bin/scylla --log-to-syslog 1 --log-to-stdout
0 --default-log-level info --'.
Program terminated with signal SIGSEGV, Segmentation fault.
(this=<optimized out>) at /usr/include/c++/5/bits/stl_pair.h:127
127     in /usr/include/c++/5/bits/stl_pair.h
[Current thread is 1 (Thread 0x7f1448f39bc0 (LWP 107308))]

Fixes #2271

Message-Id: <529ec8ede6da884e844bc81d408b93044610afd2.1491960061.git.asias@scylladb.com>
(cherry picked from commit d27b47595b)
2017-04-13 13:18:41 +03:00
Pekka Enberg
2f107d3f61 Update seastar submodule
* seastar 211ab4a...bfa1cb2 (1):
  > resource: reduce default_reserve_memory size to fit low memory environment

Fixes #2186
2017-04-12 08:41:40 +03:00
Takuya ASADA
dd9afa4c93 dist/debian/debian/scylla-server.upstart: export SCYLLA_CONF, SCYLLA_HOME
We are sourcing sysconfig file on upstart, but forgot to load them as
environment variables.
So export them.

Fixes #2236

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491209505-32293-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit b087616a6c)
2017-04-04 11:00:33 +03:00
Pekka Enberg
4021e2befb Update seastar submodule
* seastar f391f9e...211ab4a (1):
  > http: catch and count errors in read and respond

Fixes #2242
2017-04-03 12:02:43 +03:00
Calle Wilund
9b26a57288 commitlog/replayer: Bugfix: minimum rp broken, and cl reader offset too
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.

This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.

Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.

I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.

v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>

(cherry picked from commit b12b65db92)
2017-03-28 10:35:04 +02:00
Pekka Enberg
31b5ef13c2 release: prepare for 1.7.rc2 2017-03-23 13:22:59 +02:00
Takuya ASADA
4bbee01288 dist/common/scripts/scylla_raid_setup: don't discard blocks at mkfs time
Discarding blocks on large RAID volume takes too much time, user may suspects
the script doesn't works correctly, so it's better to skip, do discard directly on each volume instead.

Fixes #1896

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1489533460-30127-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit b65d58e90e)
2017-03-23 09:42:51 +02:00
Calle Wilund
3cc03f88fd commitlog_replayer: Do proper const-loopup of min positions for shards
Fixes #2173

Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.

Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.

Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
(cherry picked from commit c3a510a08d)
2017-03-22 17:57:30 +02:00
Vlad Zolotarov
4179d8f7c4 Don't report a Tracing session ID unless the current query had a Tracing bit in its flags
Although the current master's behaviour is legal it's suboptimal and some Clients are sensitive to that.
Let's fix that.

Fixes #2179

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490115157-4657-1-git-send-email-vladz@scylladb.com>
2017-03-22 14:55:39 +02:00
Pekka Enberg
c20ddaf5af dist/docker: Use Scylla 1.7 RPM repository 2017-03-21 15:07:27 +02:00
Pekka Enberg
29dd48621b dist/docker: Expose Prometheus port by default
This patch exposes Scylla's Prometheus port by default. You can now use
the Scylla Monitoring project with the Docker image:

  https://github.com/scylladb/scylla-grafana-monitoring

To configure the IP addresses, use the 'docker inspect' command to
determine Scylla's IP address (assuming your running container is called
'some-scylla'):

  docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla

and then use that IP address in the prometheus/scylla_servers.yml
configuration file.

Fixes #1827

Message-Id: <1490008357-19627-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 85a127bc78)
2017-03-20 15:30:15 +02:00
Amos Kong
87de77a5ea scylla_setup: match '-p' option of lsblk with strict pattern
On Ubuntu 14.04, the lsblk doesn't have '-p' option, but
`scylla_setup` try to get block list by `lsblk -pnr` and
trigger error.

Current simple pattern will match all help content, it might
match wrong options.
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e -p
   -m, --perms          output info about permissions
   -P, --pairs          use key="value" output format

Let's use strict pattern to only match option at the head. Example:
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e '^\s*-D'
   -D, --discard        print discard capabilities

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <4f0f318353a43664e27da8a66855f5831457f061.1489712867.git.amos@scylladb.com>
(cherry picked from commit 468df7dd5f)
2017-03-20 08:11:57 +02:00
Raphael S. Carvalho
66c4dcba8e database: serialize sstable cleanup
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.

Fixes #192.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>
(cherry picked from commit 7deeffc953)
2017-03-19 17:16:33 +02:00
Pekka Enberg
7cfdc08af9 cql3: Wire up functions for floating-point types
Fixes #2168
Message-Id: <1489661748-13924-1-git-send-email-penberg@scylladb.com>

(cherry picked from commit 3afd7f39b5)
2017-03-17 11:14:51 +02:00
Pekka Enberg
fdbe5caf41 Update scylla-ami submodule
* dist/ami/files/scylla-ami eedd12f...407e8f3 (1):
  > scylla_create_devices: check block device is exists

Fixes #2171
2017-03-17 11:14:17 +02:00
Tomasz Grabiec
522e62089b lsa: Fix debug-mode compilation error
By moving definitions of setters out of #ifdef

(cherry picked from commit 3609665b19)
2017-03-16 18:24:27 +01:00
Avi Kivity
699648d5a1 Merge "tests: Use allocating_section in lsa_async_eviction_test" from Tomasz
"The test allocates objects in batches (allocation is always under a reclaim
lock) of ~3MiB and assumes that it will always succeed because if we cross the
low water mark for free memory (20MiB) in seastar, reclamation will be
performed between the batches, asynchronously.

Unfortunately that's prevented by can_allocate_more_memory(), which fails
segment allocation when we're below the low water mark. LSA currently doesn't
allow allocating below the low water mark.

The solution which is employed across the code base is to use allocating_section,
so use it here as well.

Exposed by recent consistent failures on branch-1.7."

* 'tgrabiec/fix-lsa-async-eviction-test' of github.com:cloudius-systems/seastar-dev:
  tests: lsa_async_eviction_test: Allocate objects under allocating section
  lsa: Allow adjusting reserves in allocating_section

(cherry picked from commit 434a4fee28)
2017-03-16 12:44:54 +02:00
Calle Wilund
698a4e62d9 commitlog_replayer: Make replay parallel per shard
Fixes #2098

Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.

v2:
* Fixed whitespace errors

Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 078589c508)
2017-03-15 13:07:45 +02:00
Amnon Heiman
63bec22d28 database: requests_blocked_memory metric should be unique
Metrics name should be unique per type.

requests_blocked_memory was registered twice, one as a gauge and one as
derived.

This is not allowed.

Fixes #2165

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314162826.25521-1-amnon@scylladb.com>
(cherry picked from commit 0a2eba1b94)
2017-03-15 12:43:01 +02:00
Amnon Heiman
3d14e6e802 storage_proxy: metrics should have unique name
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.

Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.

This could be related to scylladb/seastar#250

Fixes #2163.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
(cherry picked from commit 295a981c61)
2017-03-15 12:43:01 +02:00
Glauber Costa
ea4a2dad96 raid script: improve test for mounted filesystem
The current test for whether or not the filesystem is mounted is weak
and will fail if multiple pieces of the hierarchy are mounted.

util-linux ships with a mountpoint command that does exactly that,
so we'll use that instead.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488742801-4907-1-git-send-email-glauber@scylladb.com>
(cherry picked from commit 2d620a25fb)
2017-03-13 17:04:58 +02:00
Glauber Costa
655e6197cb setup: support mount points in raid script
By default behavior is kept the same. There are deployments in which we
would like to mount data and commitlog to different places - as much as
we have avoided this up until this moment.

One example is EC2, where users may want to have the commitlog mounted
in the SSD drives for faster writes but keep the data in larger, less
expensive and durable EBS volumes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488258215-2592-1-git-send-email-glauber@scylladb.com>
(cherry picked from commit 9e61a73654)
2017-03-13 16:51:15 +02:00
Asias He
1a1370d33e repair: Fix midpoint is not contained in the split range assertion in split_and_add
We have:

  auto halves = range.split(midpoint, dht::token_comparator());

We saw a case where midpoint == range.start, as a result, range.split
will assert becasue the range.start is marked non-inclusive, so the
midpoint doesn't appear to be contain()ed in the range - hence the
assertion failure.

Fixes #2148

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>
(cherry picked from commit 39d2e59e7e)
2017-03-09 09:16:57 +01:00
Paweł Dziepak
7f17424a4e Merge "Avoid loosing changes to keyspace parameters of system_auth and tracing keyspaces" form Tomek
"If a node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.

To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.

Fixes #2129."

* tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev:
  db: Create default auth and tracing keyspaces using lowest timestamp
  migration_manager: Append actual keyspace mutations with schema notifications

(cherry picked from commit 6db6d25f66)
2017-03-08 16:31:41 +02:00
Nadav Har'El
dd56f1bec7 sstable decompression: fix skip() to end of file
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.

Fixes #2143

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
(cherry picked from commit 506e074ba4)
2017-03-08 12:35:39 +02:00
Pekka Enberg
5df61797d6 release: prepare for 1.7.rc1 2017-03-08 12:25:34 +02:00
Paweł Dziepak
b6db9e3d51 db: make do_apply_counter_update() propagate timeout to db_apply()
db_apply() expects to be given a time point at which the request will
time out. Originally, do_apply_counter_update() passed 0, which meant
that all requests were timed out if do_apply() needed to wait. The
caller of do_apply_counter_update() is already given a correct timeout
time point so the only thing needed to fix this problem it to propagate
it properly inside do_apply_counter_update() to the call to do_apply().

Fixes #2119.
Message-Id: <20170307104405.5843-1-pdziepak@scylladb.com>
2017-03-07 12:44:11 +01:00
Gleb Natapov
f2595bea85 memtable: do not open code logalloc::reclaim_lock use
logalloc::reclaim_lock prevents reclaim from running which may cause
regular allocation to fail although there is enough of free memory.
To solve that there is an allocation_section which acquire reclaim_lock
and if allocation fails it run reclaimer outside of a lock and retries
the allocation. The patch make use of allocation_section instead of
direct use of reclaim_lock in memtable code.

Fixes #2138.

Message-Id: <20170306160050.GC5902@scylladb.com>
(cherry picked from commit d7bdf16a16)
2017-03-07 11:16:15 +02:00
Gleb Natapov
e930ef0ee0 memtable: do not yield while holding reclaim_lock
Holding reclaim_lock while yielding may cause memory allocations to
fail.

Fixes #2139

Message-Id: <20170306153151.GA5902@scylladb.com>
(cherry picked from commit 5c4158daac)
2017-03-06 18:35:46 +02:00
Takuya ASADA
4cf0f88724 dist/redhat: enables discard on CentOS/RHEL RAID0
Since CentOS/RHEL raid module disables discard by default, we need enable it
again to use.

Fixes #2033

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488407037-4795-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 6602221442)
2017-03-06 12:22:17 +02:00
Avi Kivity
372f07b06e Update scylla-ami submodule
* dist/ami/files/scylla-ami d5a4397...eedd12f (3):
  > Rewrite disk discovery to handle EBS and NVMEs.
  > add --developer-mode option
  > trivial cleanup: replace tab in indent
2017-03-04 13:31:08 +02:00
Tomasz Grabiec
0ccc6630a8 db: Fix overflow of gc_clock time point
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.

I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.

Fixes the following UBSAN report:

  /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'

Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 4b6e77e97e)
2017-03-01 18:50:19 +02:00
Takuya ASADA
b95a2338be dist/debian/dep: fix broken link of gcc-5, update it to 5.4.1-5
Since gcc-5/stretch=5.4.1-2 removed from apt repository, we nolonger able to
build gcc-5.

To avoid dead link, use launchpad.net archives instead of using apt-get source.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488189378-5607-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ba323e2074)
2017-03-01 17:13:42 +02:00
Tomasz Grabiec
f2d0ac9994 query: Fix invalid initialization of _memory_tracker by moving-from-self
Fixes the following UBSAN warning:

  core/semaphore.hh:293:74: runtime error: reference binding to misaligned address 0x0000006c55d7 for type 'struct basic_semaphore', which requires 8 byte alignment

Since the field was not initialied properly, probably also fixes some
user-visible bug.
Message-Id: <1488368222-32009-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0c84f00b16)
2017-03-01 11:56:49 +00:00
Gleb Natapov
56725de0db sstable: close sstable_writer's file if writing of sstable fails.
Failing to close a file properly before destroying file's object causes
crashes.

[tgrabiec: fixed typo]

Fixes #2122.

Message-Id: <20170221144858.GG11471@scylladb.com>
(cherry picked from commit 0977f4fdf8)
2017-02-28 11:04:26 +02:00
Avi Kivity
6f479c8999 Update seastar submodule
* seastar b14373b...f391f9e (1):
  > fix append_challenged_posix_file_impl::process_queue() to handle recursion

Fixes #2121.
2017-02-28 10:55:54 +02:00
Calle Wilund
8c0488bce9 messaging_service: Move log printout to actual listen start
Fixes  #1845
Log printout was before we actually had evaluated endpoint
to create, thus never included SSL info.
Message-Id: <1487766738-27797-1-git-send-email-calle@scylladb.com>

(cherry picked from commit d5f57bd047)
2017-02-23 13:18:33 +02:00
Avi Kivity
68dd11e275 config: enable new sharding algorithm for new deployments
Set murmur3_partitioner_ignore_msb_bits to 12 (enabling the new sharding
algorithm), but do this in scylla.yaml rather than the built-in defaults.
This avoids changing the configuration for existing clusters, as their
scylla.yaml file will not be updated during the upgrade.
Message-Id: <20170214123253.3933-1-avi@scylladb.com>

(cherry picked from commit 9b113ffd3e)
2017-02-22 11:23:46 +01:00
Tomasz Grabiec
a64c53d05f Update seastar submodule
* seastar fc27cec...b14373b (1):
  > reactor utilization should return the utilization in 0-1 range
2017-02-22 09:38:17 +01:00
Paweł Dziepak
42e7a59cca tests/cql_test_env: wait for storage service initialization
Message-Id: <20170221121130.14064-1-pdziepak@scylladb.com>
(cherry picked from commit 274bcd415a)
2017-02-21 17:06:10 +02:00
Avi Kivity
2cd019ee47 Merge "Fixes for counter cell locking" from Paweł
"This series contains some fixes and a unit test for the logic responsible
for locking counter cells."

* 'pdziepak/cell-locking-fixes/v1' of github.com:cloudius-systems/seastar-dev:
  tests: add test for counter cell locker
  cell_locking: fix schema upgrades
  cell_locker: make locker non-movable
  cell_locking: allow to be included by anyone

(cherry picked from commit b8c4b35b57)
2017-02-15 17:37:38 +02:00
Takuya ASADA
bc8b553bec dist/redhat: stop backporting ninja-build from Fedora, install it from EPEL instead
ninja-build-1.6.0-2.fc23.src.rpm on fedora web site deleted for some
reason, but there is ninja-build-1.7.2-2 on EPEL, so we don't need to
backport from Fedora anymore.

Fixes #2087

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487155729-13257-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 9c8515eeed)
2017-02-15 12:58:44 +02:00
Avi Kivity
0ba98be899 Update seastar submodule
* seastar bff963a...fc27cec (1):
  > collectd: send double correctly for gauge
2017-02-14 16:09:22 +02:00
Avi Kivity
d6899134a7 Update seastar submodule
* seastar f07f8ed...bff963a (1):
  > prometheus: send one MetricFamily per unique metric name
2017-02-13 11:50:43 +02:00
Avi Kivity
5253031110 seastar: point submodule at scylla-seastar.git
Allows backporting seastar patches independently of master.
2017-02-13 11:49:54 +02:00
Avi Kivity
a203c87f0d Merge "Disallow mixed schemas" fro Paweł
"This series makes sure that schemas containing both counter and non-counter
regular or static columns are not allowed."

* 'pdziepak/disallow-mixed-schemas/v1' of github.com:cloudius-systems/seastar-dev:
  schema: verify that there are no both counter and non-counter columns
  test/mutation_source: specify whether to generate counter mutations
  tests/canonical_mutation: don't try to upgrade incompatible schemas

(cherry picked from commit 9e4ae0763d)
2017-02-07 18:04:24 +02:00
Gleb Natapov
37fc0e6840 storage_proxy: use storage_proxy clock instead of explicit lowres_clock
Merge commit 45b6070832 used butchered version of storage_proxy
patch to adjust to rpc timer change instead the one I've sent. This
patch fixes the differences.

Message-Id: <20170206095237.GA7691@scylladb.com>
(cherry picked from commit 3c372525ed)
2017-02-06 12:51:52 +02:00
Avi Kivity
0429e5d8ea cell_locking: work around for missing boost::container::small_vector
small_vector doesn't exist on Ubuntu 14.04's boost, use std::vector
instead.

(cherry picked from commit 6e9e28d5a3)
2017-02-05 20:49:43 +02:00
Avi Kivity
3c147437ac dist: add build dependency on automake
Needed by seastar's c-ares.

(cherry picked from commit 2510b756fc)
2017-02-05 20:17:27 +02:00
Takuya ASADA
e4b3f02286 dist/common/systemd: introduce scylla-housekeeping restart mode
scylla-housekeeping requires to run 'restart mode' for check the version during
scylla-server restart, which wasn't called on systemd timer so added it.

Existing scylla-housekeeping.timer renamed to scylla-housekeeping-daily.timer,
since it is running 'daily mode'.

Fixes #1953

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1486180031-18093-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit e82932b774)
2017-02-05 11:28:03 +02:00
Avi Kivity
5a8013e155 dist: add libtool build dependency for seastar/c-ares
(cherry picked from commit 4175f40da1)
2017-02-05 11:27:38 +02:00
Pekka Enberg
fdba5b8eac release: prepare for 1.7.rc0 2017-02-04 11:04:32 +02:00
Paweł Dziepak
558a52802a cell_locking: fix parititon_entry::equal_compare
The comparator constructor took schema by value instead of const l-ref
and, consequently, later tried to access object that has been destroyed
long time ago.
Message-Id: <20170202135853.8190-1-pdziepak@scylladb.com>

(cherry picked from commit 37b0c71f1d)
2017-02-03 21:28:42 +02:00
Avi Kivity
4f416c7272 Merge "Avoid avalanche of tasks after memtable flush" from Tomasz
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()

(cherry picked from commit 7a00dd6985)
2017-02-03 09:47:50 +01:00
Paweł Dziepak
788892e931 counters: fix build failure on gcc5
Message-Id: <20170202132049.4497-1-pdziepak@scylladb.com>
2017-02-02 14:23:49 +01:00
Piotr Jastrzebski
36b2c4df19 row_cache_test: extend test_mvcc
Make the test execute with and without an active
reader to memtable that's flushed to cache.

This improves the code covarage of MVCC with tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <007b6cd1ba7a84ea5675ea82e454bf1adf3b3330.1485954941.git.piotr@scylladb.com>
2017-02-02 13:51:32 +01:00
Tomasz Grabiec
5458a32f13 gdb: Introduce commands for inspecting pending task queue
Message-Id: <1485426236-6627-1-git-send-email-tgrabiec@scylladb.com>
2017-02-02 13:15:17 +02:00
Avi Kivity
000edc36c4 Merge "Counters" from Paweł
"This series introduces support for counters. The implementation of
counters more or less follows the design described on our wiki page [1].
Counter cells contain many shards with replicas being able to modify
and announce new versions only of the shards that they own. Historically,
there were three types of shards: local, remote and global. In these
patches only support for the global ones is added.

[1] https://github.com/scylladb/scylla/wiki/Counters

Currently, counters are only enabled as experimental features as there
still several things that need to be done before they become production
ready. Namely, the performance is expected to be quite poor (especially
for writes), there is no proper tracing support and timed out counter
requests may not be recognized and dropped early. There are also no
counter-related metrics.

However, apart from these problems there are no other missing parts of
counter implementation and they are expected to work correctly.

Fixes #577."

* 'pdziepak/counters/v3-rebased' of github.com:cloudius-systems/seastar-dev: (38 commits)
  perf_simple_query: add counter tables tests
  thrift: add support for counter operations
  cql3: allow counters in CREATE TABLE statements
  cql3: selection: do not panic when seeing counters
  storage_proxy: support counter updates
  storage_proxy: add get_live_endpoints()
  cql3: add counter increment and decrement operations
  db: add operations for applying counter updates
  counters: implement transforming counter deltas to shards
  add infrastructure for locking counter cells
  add fnv1a hasher
  position_in_partition: add feed_hash()
  position_in_partition: add functions for querying object type
  types: make counter_type_impl report its cql3_type
  transport: encode counters as long_type
  mutation_partition: make for_each_cell() accessible outside source file
  messaging_service: add COUNTER_MUTATION verb
  storage_service: add COUNTERS feature
  idl: add idl description of consistency level
  schema: make is_counter() return correct value
  ...
2017-02-02 12:40:09 +02:00
Paweł Dziepak
8671d8329d perf_simple_query: add counter tables tests 2017-02-02 10:35:14 +00:00
Paweł Dziepak
4ca7f0a491 thrift: add support for counter operations 2017-02-02 10:35:14 +00:00
Paweł Dziepak
fa29ef3cc0 cql3: allow counters in CREATE TABLE statements 2017-02-02 10:35:14 +00:00
Paweł Dziepak
fce6e0987f cql3: selection: do not panic when seeing counters
At this stage counters cells are already long_type values, so no special
handling is necessary.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
1e8814f5ce storage_proxy: support counter updates 2017-02-02 10:35:14 +00:00
Paweł Dziepak
c14c6b753b storage_proxy: add get_live_endpoints() 2017-02-02 10:35:14 +00:00
Paweł Dziepak
d6ebf84edf cql3: add counter increment and decrement operations 2017-02-02 10:35:14 +00:00
Paweł Dziepak
5a0955e89d db: add operations for applying counter updates 2017-02-02 10:35:14 +00:00
Paweł Dziepak
8d889082bf counters: implement transforming counter deltas to shards
The leader receives counter updates as deltas which have to be
transformed to counter shards. In order to do that, current local shard
of the modified counter cell needs to be read, logical clock incremented
and the value modified by the specified delta.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
55277b3182 add infrastructure for locking counter cells
The leader receives counter update in a form of deltas which need to be
transformed to counter shards. In order to do that the node needs to
read its current state of the modified counter cells. Since this is
essentially a read-modify-write opertation an appropriate locking
mechanism is needed.

Counter cell locker introduced in this patch uses a hashtable of
partition entry each containing a hashtable of cell entries. Inside a
cell entry there is a semaphore used for synchronization. Once no longer
needed cell entries and partition entries are removed.

In order to avoid deadlocks cell entries are always locked in the same
order which is the lexicographical order of (clustering key, column id)
pairs. Note that schema changes are not a difficulty since they do not
make it possible to change ordering of such pairs.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
22fbb11f90 add fnv1a hasher 2017-02-02 10:35:14 +00:00
Paweł Dziepak
a16761dcb4 position_in_partition: add feed_hash() 2017-02-02 10:35:14 +00:00
Paweł Dziepak
f4fce93807 position_in_partition: add functions for querying object type 2017-02-02 10:35:14 +00:00
Paweł Dziepak
53d9a6f220 types: make counter_type_impl report its cql3_type 2017-02-02 10:35:14 +00:00
Paweł Dziepak
a805bea97a transport: encode counters as long_type
For the purposes of CQL counters are long values (either a delta in case
of writes or the final value for reads).
2017-02-02 10:35:14 +00:00
Paweł Dziepak
b6564651e4 mutation_partition: make for_each_cell() accessible outside source file
for_each_cell() const already can be used from any place in the code,
allow the same with non-const version.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
bf60b7844b messaging_service: add COUNTER_MUTATION verb
This verb is going to be used for coordinator<->leader communication
during counter updates.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
67ca6959bd storage_service: add COUNTERS feature 2017-02-02 10:35:14 +00:00
Paweł Dziepak
9989239c97 idl: add idl description of consistency level 2017-02-02 10:35:14 +00:00
Paweł Dziepak
4b3c0db5cc schema: make is_counter() return correct value 2017-02-02 10:35:14 +00:00
Paweł Dziepak
99b21fbb86 tests: random_mutation_generator: generate counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
de2acd47c9 tests/sstables: test reading and writing counters 2017-02-02 10:35:14 +00:00
Paweł Dziepak
83c6fc1114 sstables: write counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
5905729c4a sstables: read counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
de698105e4 tests/counter: test apply, difference and freeze 2017-02-02 10:35:14 +00:00
Paweł Dziepak
0c93d01232 atomic_cell: make sure upper level tombstones cover counters
Support for deletion of counters is limited in a way that once deleted
they cannot be used again (i.e. tombstone always wins, regardless of the
timestamp). Logic responsible for merging two counter cells already
makes sure that tombstones are handled properly, but it is also
necessary to ensure that higher level tombstones always cover counters.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
9f1ebd4f7c idl/mutation: add counter serialisation logic 2017-02-02 10:35:14 +00:00
Paweł Dziepak
47d14906e6 mutation_partition: support querying counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
63f25eb12c mutation_hasher: handle counter cells properly 2017-02-02 10:35:14 +00:00
Paweł Dziepak
25c8ed1c71 feed_hash: allow additional arguments 2017-02-02 10:35:14 +00:00
Paweł Dziepak
a57e86cc37 mutation_partition: compute counter difference 2017-02-02 10:35:13 +00:00
Paweł Dziepak
2725a4945d mutation_partition: apply counter cells properly 2017-02-02 10:35:13 +00:00
Paweł Dziepak
496b42fcc7 tests: add test for counters 2017-02-02 10:35:13 +00:00
Paweł Dziepak
7bb5b49799 add in memory representation of counters
Live counter cells are collections of shards, each one representing the
sum of all operations performed by a particular replica. This commits
introduces an in-memory representation of counters as well as basic
operations such as merge, difference and hashing.
2017-02-02 10:35:13 +00:00
Paweł Dziepak
c66db213d3 storage_service: allow getting local host id without futures<> 2017-02-02 10:35:13 +00:00
Paweł Dziepak
0a8f00c159 atomic_cell: add flag for recognizing counter updates
A counter cell may be either a collection of shards or just a delta. The
former can only appear in certain places on coordinator and leader.
2017-02-02 10:35:13 +00:00
Paweł Dziepak
ab344c5aa3 mutation_partition_view: extract atomic_cell variant 2017-02-02 10:35:13 +00:00
Paweł Dziepak
83f6018ea2 schema: keep counter information in column definition 2017-02-02 10:35:13 +00:00
Avi Kivity
aec419da13 Merge seastar upstream
* seastar c1dbd89...f07f8ed (3):
  > Merge "Introduce when_all_succeed()" from Paweł
  > tests: adjust collectd test for metric API change
  > Merge "DNS query support" from Calle
2017-02-02 12:30:10 +02:00
Piotr Jastrzebski
15cc8460bd mutation_partition: make rows_entry constructors explicit
All converting constructors should be explicit otherwise they
can create a confusion. I got myself in such a situation when
clustering key got implicitly converted into rows_entry when
I was not expecting it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c3f19719760f6dc7cf5e858b9c452506faedf521.1485950529.git.piotr@scylladb.com>
2017-02-01 17:57:50 +01:00
Amnon Heiman
45b6070832 Merge seastar upstream
* seastar 397685c...c1dbd89 (13):
  > lowres_clock: drop cache-line alignment for _timer
  > net/packet: add missing include
  > Merge "Adding histogram and description support" from Amnon
  > reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
  > Set the option '--server' of tests/tcp_sctp_client to be required
  > core/memory: Remove superfluous assignment
  > core/memory: Remove dead code
  > core/reactor: Use logger instead of cerr
  > fix inverted logic in overprovision parameter
  > rpc: fix timeout checking condition
  > rpc: use lowres_clock instead of high resolution one
  > semaphore: make semaphore's clock configurable
  > rpc: detect timedout outgoing packets earlier

Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.

Includes fixup from Amnon:

collectd api should use the metrics getters

As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.

This will be important when the internal implementation will changed
from union to variant.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
2017-02-01 14:39:08 +02:00
Glauber Costa
facb0aa6d9 row_cache: rewrite loop so that debug mode doesn't become a noop
need_preempt() is always true in debug mode. Because of that, this loop
will never be executed. Rewrite it as a do-while loop so we are sure
that it is executed at least once - or exactly once in debug mode.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1485913079-1283-1-git-send-email-glauber@scylladb.com>
2017-02-01 10:02:13 +02:00
Tomasz Grabiec
634761dbba commitlog: Fix default limit for size on disk
The per-node limit will be total memory divided by number of shards
instead of just total memory. For example, when Scylla is started with
-c16 -m16G, the commit log will induce flushes on given shard when
unflushed data exceeds on that shard 62MB instead of 1GB.

Fixes #2046.

Message-Id: <1485874534-10939-1-git-send-email-tgrabiec@scylladb.com>
2017-01-31 17:12:59 +02:00
Piotr Jastrzebski
c7e95af0b0 row_cache_test: fix test_mvcc
Currently the test does not wait for cache update
to finish before carrying on with the checks.

This makes the test nondeterministic and purely wrong
because checks expect update to be finished.

This patch changes the test to wait for update to finish.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2a99bba24b1628466d3495332b48ef3ccdb43c26.1485862389.git.piotr@scylladb.com>
2017-01-31 11:37:29 +00:00
Avi Kivity
aedb5e5cfa mutation_fragment: add std::ostream support
Helps poor debuggers.
Message-Id: <20170130163605.4858-1-avi@scylladb.com>
2017-01-31 10:37:42 +01:00
Tomasz Grabiec
0d40b86546 Merge "bail sooner from cache update if need_preempt()" from Glauber
An earlier patch of mine was using should_yield to do the same.  That
is a better direction, but should_yield() was demonstrably more
expensive so for now we'll go with need_preempt() - since this is
hurting pretty much every latency-dependent workload.

I am also including the scripts that I have used to measure and
compare the various versions of this patch.
2017-01-31 09:51:34 +01:00
Pekka Enberg
a625aae489 cql3/values.hh: Fix to_bytes_opt(raw_value)
The data() method already returns a bytes_opt so there's no need to call to_bytes_opt() again.

Fixes compliation failure on CentOS:

  In file included from ./cql3/query_options.hh:51:0,
                   from ./cql3/cql_statement.hh:47,
                   from ./cql3/statements/raw/select_statement.hh:45,
                   from build/release/gen/cql3/CqlParser.hpp:65,
                   from build/release/gen/cql3/CqlParser.cpp:44:
  ./cql3/values.hh: In function 'bytes_opt to_bytes_opt(const cql3::raw_value&)':
  ./cql3/values.hh:184:37: error: no matching function for call to 'to_bytes_opt(bytes_opt)'
       return to_bytes_opt(value.data());

Message-Id: <1485761863-28236-1-git-send-email-penberg@scylladb.com>
2017-01-30 10:49:31 +02:00
Gleb Natapov
6e4817137e storage_proxy: report foreground reads instead of reads
The reason is the same as why foreground writes are reported instead of
total writes (049ae37d08): It is much easier to see what is going on
this way.

Also fixes a typo in a counter's description.

Fixes #1217

Message-Id: <20170129093412.GS11469@scylladb.com>
2017-01-29 12:40:56 +02:00
Avi Kivity
9fb2f31616 Merge "CQL binary protocol unset value support" from Pekka
This patch series adds support for "unset values" that were introduced
in CQL binary protocol v4. They allow bound statements to skip updates
to some or all of the bound variables.

Unset values are specified using the BoundStatement.unset() method in
the Java driver:

  http://docs.datastax.com/en/drivers/java/3.1/com/datastax/driver/core/BoundStatement.html#unset-int-

and using the UNSET_VALUE constant in the Python driver:

  https://datastax.github.io/python-driver/api/cassandra/query.html#cassandra.query.UNSET_VALUE

Fixes #2039.

* 'penberg/cql-unset-values/v2' of github.com:cloudius-systems/seastar-dev:
  transport/server: CQL unset value support
  cql3/statements/select_statement: Unset value support
  cql3/user_types: Unset value support
  cql3/tuples: Unset value support
  cql3/maps: Unset value support
  cql3/sets: Unset value support
  cql3/lists: Unset value support
  cql3/constants: UNSET_VALUE constant
  cql3/constants: Unset value support
  cql3/attributes: Unset value support
  types.hh: Add field_name_as_string() to user_type_impl type
  cql3: Introduce raw_value and raw_value_view types
2017-01-29 10:59:01 +02:00
Pekka Enberg
533c8d3949 transport/server: CQL unset value support
This patch implements support for CQL unset values at the protocol level.

Fixes #2039
2017-01-27 09:24:36 +02:00
Pekka Enberg
2bd560118e cql3/statements/select_statement: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
baaf1779c5 cql3/user_types: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
99c7dabd2a cql3/tuples: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
a0e6f6f371 cql3/maps: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
f883e64d70 cql3/sets: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
50ec81ee67 cql3/lists: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
c4cd0a6541 cql3/constants: UNSET_VALUE constant 2017-01-27 09:24:36 +02:00
Pekka Enberg
063be3ed44 cql3/constants: Unset value support 2017-01-27 09:24:36 +02:00
Glauber Costa
b4ac2c1d60 debug: add systemtap script to measure interesting latencies during cache updates.
Example output:

Measuring Scylla row cache update times ^C
Total update time, (usec)
value |-------------------------------------------------- count
    2 |                                                   0
    4 |                                                   0
    8 |@@                                                 2
   16 |@@@                                                3
   32 |                                                   0
   64 |                                                   0
  128 |@@@@                                               4
  256 |@@                                                 2
  512 |                                                   0
 1024 |                                                   0

Time spent per partition batch (nsec)
 value |-------------------------------------------------- count
   128 |                                                       0
   256 |                                                       0
   512 |                                                      43
  1024 |                                                       2
  2048 |                                                       2
  4096 |                                                      45
  8192 |                                                     349
 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  61494
 32768 |@@@@@@@@@@@@@@@@@                                  21497
 65536 |                                                       0
131072 |                                                       0

Partitions updated per batch:
value |-------------------------------------------------- count
    0 |                                                      57
    1 |                                                      46
    2 |                                                      76
    4 |                                                     134
    8 |                                                     324
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  82795
   32 |                                                       0
   64 |                                                       0

Total partitions updated: 2485000
Average time spent per partition batch (nsec): 28816
Average time per partition per partition (nsec): 967

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-01-26 22:15:16 -05:00
Glauber Costa
69dbb3e108 row_cache: yield if need_preempt(), even if there is quota left.
The quota check is quite old at the moment, and dates back to a time in
which the infrastructure in seastar threads was lacking a lot. It is a
bad check since it will not take into consideration the size of the
partition or the time it takes to merge them.

A better check would at least take need_preempt() into account, so that
we would respect the task quota. That check is now embedded into
should_yield(), so there would no need to check anything else.

Although should_yield() does the job, it is still currently quite
expensive. And because we are in a seastar thread with a computationally
intensive loop, it can hurt latency a lot.

So as a temporary measure, let's at least check for need_preempt() - as
it is hurting real users at the moment - and soon work on making
should_yield() cheaper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-01-26 22:10:54 -05:00
Glauber Costa
0e1f64b163 row_cache: add systemtap markers for the update process
update is one of our biggest sources of performance issues as far as the
cache is concerned. systemtap can be useful in helping tracking some of
them down.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-01-26 21:56:32 -05:00
Duarte Nunes
937ed1bacb bound_view: Simplify copy ctor
By using default generation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1485355007-1913-1-git-send-email-duarte@scylladb.com>
2017-01-26 19:29:29 +02:00
Avi Kivity
b91b9b351a Revert "Merge seastar upstream"
This reverts commit f301c678bfe5eb5df71f71fd20e08b422b1023bb; the rpc changes
don't compile due to rpc timeout type change.
2017-01-26 18:30:56 +02:00
Avi Kivity
f301c678bf Merge seastar upstream
* seastar 397685c...f5fa2e3 (3):
  > rpc: use lowres_clock instead of high resolution one
  > semaphore: make semaphore's clock configurable
  > rpc: detect timedout outgoing packets earlier
2017-01-26 18:16:14 +02:00
Pekka Enberg
3385144860 cql3/attributes: Unset value support 2017-01-26 13:50:04 +02:00
Pekka Enberg
630aba32ff types.hh: Add field_name_as_string() to user_type_impl type
This is needed to construct validation error messages when user types
encounter unset values.
2017-01-26 13:50:04 +02:00
Pekka Enberg
be0351b49c cql3: Introduce raw_value and raw_value_view types
Currently, the code is using bytes_opt and bytes_view_opt to represent
CQL values, which can hold a value or null. In preparation for
supporting a third state, unset value introduced in CQL v4, introduce
new raw_value and raw_value_view types and use them instead.

The new types are based on boost::variant<> and are capable of holding
null, unset values, and blobs that represent a value.
2017-01-26 13:50:04 +02:00
Gleb Natapov
64660397fc storage_proxy: move operation type information from counter's name to a label
Makes it much more flexible to view the data in various ways in Graphana.

Message-Id: <20170126102746.GL11469@scylladb.com>
2017-01-26 12:38:29 +02:00
Tomasz Grabiec
2c7902fb2b Revert "lsa: Reduce reclamation latency"
This reverts commit d61002cc33.

Introduced a regression in row_cache_alloc_stress.

The problem is that reclaim_from_evictable() evicts way too much after
the refactor due to the stop condition not taking into account how
much data was evicted so far and only looking at occupancy of the
minimal segment. This may lead to eviction of the whole region.
2017-01-26 10:43:18 +01:00
Paweł Dziepak
8cdffd7c57 time_type_impl: value initialize result
parse_time() adds hourse, minutes, etc to a final value 'result'.
However, it is of type std::chrono::nanoseconds which means it is not
zeroed at initialization unless it is explicitly asked to do so.

Fixed debug mode failures in types_tyes and cql_query_test.

Message-Id: <20170125155239.1253-1-pdziepak@scylladb.com>
2017-01-25 17:56:31 +02:00
Paweł Dziepak
034d028329 Merge "range_tombstone_list: Properly implement difference()" from Duarte
"This patchset properly implements range_tombstone_list::difference(),
which was very broken. We add unit tests for the function and ensure
we always randomly generate range_tombstones in other unit tests so
other problems aren't hidden."
2017-01-25 12:08:19 +00:00
Duarte Nunes
8c65b98ea7 mutation_merger: Emit deferred tombstones
This patch ensures the mutation_merger emits any deferred tombstones
that it still may be holding before closing the stream.

Together with the range_tombstone_list: Properly implement
difference() patch set, this fixes breakage of streamed_mutation_test
and row_cache_test.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170123195643.9876-1-duarte@scylladb.com>
2017-01-25 12:02:03 +00:00
Takuya ASADA
bce0fb3fa2 dist: add lspci on dependencies, since it used by dpdk-devbind.py
On minimum setup environment scylla_sysconfig_setup will fail because lspci command is not installed. So install it on package installation time.

Fixes #2035

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485327435-20543-1-git-send-email-syuu@scylladb.com>
2017-01-25 10:22:57 +02:00
Avi Kivity
d2fc98270e Merge seastar upstream
* seastar 6d80c6a...397685c (4):
  > Merge "add label to the io_queue" from Amnon
  > rpc: Modify the shutdown code to wait and handle exceptions
  > tls.cc: Fix shutdown_input/output to conform with expected socket behaviour
  > core: Add counter for polls
2017-01-24 18:36:25 +02:00
Gleb Natapov
ccee01f352 storage_proxy: put datacenter name into a label instead of counter's name
Having datacenter name as a label makes it possible to create Prometheus board for the counters.

Message-Id: <20170124132051.GX11469@scylladb.com>
2017-01-24 15:27:34 +02:00
Duarte Nunes
54a464ae27 random_mutation_generator: Always generate range tombstones
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 19:02:23 +01:00
Duarte Nunes
a01aa91c82 range_tombstone_list: Add unit tests for difference()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 18:14:33 +01:00
Duarte Nunes
85315d1760 range_tombstone_list: Correctly implement difference()
The difference method wasn't properly implemented. The version in this
patch correctly computes the difference and returns a range tombstone
list contains those range tombstones in "this" but absent from the
other, specified range tombstone list.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 18:14:33 +01:00
Duarte Nunes
e7d20ea900 range_tombstone_list: Add apply() convenience overload
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 18:14:33 +01:00
Duarte Nunes
0847954d92 bound_view: Add copy ctor and assignment operator
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 18:14:33 +01:00
Avi Kivity
1758361640 Merge seastar upstream
* seastar 38aaa4a...6d80c6a (2):
  > DPDK: Change the metrics registration with label support
  > metric: Fix the error: could not convert {...} from <brace-enclosed initializer list> to struct metric_definition_impl
2017-01-23 11:55:21 +02:00
Takuya ASADA
f6d7a76223 dist: rename dist/ubuntu to dist/debian
Now we supported both Ubuntu and Debian on dist/ubuntu, and Ubuntu is one of
Debian variant, so dist/debian is better naming.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485161896-21851-1-git-send-email-syuu@scylladb.com>
2017-01-23 10:59:52 +02:00
Avi Kivity
31c8e6885b build: improve support for custom builds
Add a counter field to RELEASE, just before the date, and fix it at zero.
This allows custom package builds to override it in a way that sorts before
the official packages.

Example:

  Official release:   1.6.0-0.20160120.<githash>
  Custom release 1:   1.6.0-1.avi.20160121.<githash>
  Custom release 2:   1.6.0-2.avi.20160122.<githash>

The counter (0/1/2) ensures that the build number dominates over the date
when sorting.

Message-Id: <20170122102814.19649-1-avi@scylladb.com>
2017-01-22 14:56:52 +02:00
Avi Kivity
1be9c232b6 Merge seastar upstream
* seastar ff098c8...38aaa4a (1):
  > metrics: equal operator should use ==
2017-01-22 14:41:59 +02:00
Tomasz Grabiec
834df74df0 Merge batch statement optimization from github.com/avikivity/scylla/1689/v2
From Avi:

In many cases, batch statements are used to mutate a single partition, or
a number of partitions that is smaller than the number of statements within
the batch.  We can detect this case and reduce the numbers of mutations
applied, and in some cases, convert a logged batch into an unlogged batch.

Ref #1689.
2017-01-20 13:44:05 +01:00
Tomasz Grabiec
6c75614d19 sstables: Fix input_stream not being closed by index_reader
Fixes #2022
Message-Id: <1484912679-5729-1-git-send-email-tgrabiec@scylladb.com>
2017-01-20 11:58:33 +00:00
Paweł Dziepak
19ad35610b sstables: do not discard future returned by fast_forward_to()
continuous_data_consumer::fast_forward_to() returns a future which was
later ignored by data_consume_context::fast_forward_to().

With the current implementation, the future in question is always ready
and that's why the problem didn't manifest itself in the form of crashes
or invalid results.
Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>
2017-01-20 12:22:17 +01:00
Avi Kivity
a9403877e4 cql3: add more metrics for batch statements
- how many statements are in a batch
 - different types of batches
 - whether we were able to convert a logged batch to an unlogged batch
2017-01-20 13:19:00 +02:00
Avi Kivity
e3c003544d cql3: optimize batch_statement when the same partition is mutated multiple times
Batch statements are often used to insert multiple rows into the same
partition.  Recognize this case and merge mutations to the same partition.

If the result is a single mutation, there is an additional win (already
present in the code), where a logged batch can be converted into an unlogged
batch.

Ref #1689.
2017-01-20 13:18:56 +02:00
Benoît Canet
bcc826cc34 mutation_reader: Short circuit the read path on empty range
Add a boolean to short circuit the read path on empty range
hoping for some speedup.

tested in read write with cs using:

cl=QUORUM duration=1m -mode native cql3 -rate threads=700 -node localhost

Will do some additional benchmark.

Fixes #1056

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170118194451.16836-1-benoit@scylladb.com>
2017-01-20 10:05:40 +00:00
Avi Kivity
54b8acdd9f dht: add hashing and comparison helpers to dht::decorarted_key
An std::hash specialization, and an equality comparator.
2017-01-20 11:24:14 +02:00
Avi Kivity
141048e0e5 dht: improve token hash function
For a small token, we can just return it, since it already is a hash.
We hash large tokens using murmur3, which is supposedly a good hash.
2017-01-20 11:24:14 +02:00
Raphael S. Carvalho
1857ba0abc db: fix bad resource usage distribution when resharding due to refresh
That's because a single shard is used to calculate generation for new
sstables in upload directory, and that will result in that single shard
sharing all the resources with other shards.
For refresh without upload dir, it currently works fine because we
reshuffle column family dir instead.

flush_upload_dir() is now a free function, takes a distributed database
object, and uses calculate_shard_from_sstable_generation() to decide
which shard will move sstable using its own generation namespace.

Fixes #2008.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>
2017-01-19 18:55:21 +02:00
Duarte Nunes
d53f96e0da column_family: Only update stats once for a shared sstables
This patch ensures that when adding a shared sstable, we select only
one cpu to update that column family's stats. This is important so we
don't overestimated the on-disk size of sstables when resharding

This fixes only a temporary miscount of the current load, since shared
sstables are eventually re-written, but a fixes a permanent miscount
of the total load.

Refs #1592

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170119144823.31041-1-duarte@scylladb.com>
2017-01-19 17:40:35 +02:00
Tomasz Grabiec
d61002cc33 lsa: Reduce reclamation latency
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

The change improves worst case latency. Reclamation time statistics
over 30 second period after cache fills up, in microseconds:

Before:

  avg = 1524.283148
  stdev = 11021.021118
  min = 12.934000
  max = 144356.000000
  sum = 257603.852000
  samples = 169

After:

  avg = 1317.362414
  stdev = 1913.542802
  min = 263.935000
  max = 19244.600000
  sum = 175209.201000
  samples = 133

Refs #1634.

Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
2017-01-19 17:35:36 +02:00
Amos Kong
b880bdccef dist/redhat: fix path of housekeeping.cfg
scylla-housekeeping[3857]: Config file /etc/scylla.d/housekeeping.cfg is missing, terminating

Housekeeping failed to execute for missing the config file,
the config file should be in /etc/scylla.d/.

Fixes #2020

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <e63f2f8cb94410a6dca4e6193932f0079755ad47.1484724328.git.amos@scylladb.com>
2017-01-19 11:08:46 +02:00
Avi Kivity
3c05a81ef9 Merge seastar upstream
* seastar 240b0bf...ff098c8 (15):
  > metrics::impl::shard(): check if reactor is initialized before using it
  > reactor: introduce engine_is_ready()
  > fix metric name
  > Merge "Add label support to the metric layer" from Amnon
  > core: Avoid memory leak when submission to syscall_work_queue fails
  > core: Avoid memory leak when submission to smp_message_queue fails
  > core: append_challenged_posix_file_impl: Make exception-safe
  > Merge "Log backtrace in report_failed_future" from Tomasz
  > install-dependencies.sh: add systemtap-sdt-dev to Ubuntu/Debian dependencies
  > core: add fsqual.cc/.hh to core
  > dpdk: Fix compile error with rte_pci.h
  > fstream_test: fix spurious failures due to BOOST_REQUIRE_EQUAL thread-unsafety
  > reactor: unregister metrics of queue on shard 0
  > build: track system header changes too
  > Prometheus: do not rely on collectd for the hostname
2017-01-19 11:00:12 +02:00
Tomasz Grabiec
dd0fb48564 sstables: Close _file even if random_access_reader::close() reports errors
close() operation is like a destructor, it cannot fail. It just
reports errors, but close itself succeeds. So we should proceed with
the closing even if it fails.
Message-Id: <1484245886-7269-1-git-send-email-tgrabiec@scylladb.com>
2017-01-18 12:41:55 +00:00
Tomasz Grabiec
d048eec254 row_cache: Fix stats handling for uncached wide partitions
Report hitting wide partition dummy as a cache miss instead of a hit.

Refs #2011
Message-Id: <1484302266-3828-1-git-send-email-tgrabiec@scylladb.com>
2017-01-18 09:58:04 +00:00
Tomasz Grabiec
87f15624f4 row_cache: Add counter for wide partition mispopulations
Message-Id: <1484733250-14470-1-git-send-email-tgrabiec@scylladb.com>
2017-01-18 09:57:51 +00:00
Calle Wilund
5da92db432 cell_comparator: Better fix (i.e. potentially correct) for compound/clustered desc.
As Tomek pointed out, previous code, regardless of version mismatch, of generating
comparator description string was not correct (as in: in sync with origin).
This modifies it to look at
1.) Actual clustring size
2.) Compound-ness
3.) Dense-ness

to determine whether we should generate a compound desc, and whether it
should contain a trailing utf8-desc type.

v2: Simplify non-dense base column addition and ensure it handles
    thrift non-utf8 (as per comments from tomek)
Message-Id: <1484670171-18362-1-git-send-email-calle@scylladb.com>
2017-01-17 18:03:11 +01:00
Amnon Heiman
e19fa02a17 remove scollectd from headers
As the metrics migration progressed, some include to scollectd.hh left
behind.

Because of the nature of the scollecd implementation those include
brings alot of code with them to the header files and eventually to many
source file.

This patch remove those include and add a missing include to
storage_proxy.cc.

The reason the compiler didn't complain is an indication to the
problematic nature of those include in the first place.

Before this patch, change in metrics.hh would cause 169 files to
compile, after this change 17.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>
2017-01-17 17:39:47 +02:00
Calle Wilund
7d2a4defcf schema: Fix version check for comparator desc string formatting
Fixes #2019

According to the Java driver and cassandra, all versions < 3
include the PK in the comparator descriptor string.

This broke for us when bumping the cassandra version 2.1 -> 2.2

Message-Id: <1484657580-14411-1-git-send-email-calle@scylladb.com>
2017-01-17 14:59:47 +02:00
Tomasz Grabiec
ddfee57c97 Replace iostream include with iosfwd in headers
Message-Id: <1484656119-8386-4-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:44 +02:00
Tomasz Grabiec
50e3e3af08 db: Add missing include
Message-Id: <1484656119-8386-3-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:44 +02:00
Tomasz Grabiec
ea9ab36ad5 db: Move operator<<() definition to .cc
Message-Id: <1484656119-8386-2-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:43 +02:00
Duarte Nunes
c8cbfb7919 storage_service: Make MV feature experimental
This patch ensures that the host only announces and registers the
MATERIALIZED_VIEWS feature if it was started with the experimental
flag.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170116123412.21365-1-duarte@scylladb.com>
2017-01-16 15:45:25 +02:00
Tomasz Grabiec
a559a7ae19 streamed_mutation: Fix memory corruption when reader constructor throws
After we call unlink_leftmost_without_rebalance(), we must unlink all
elements before mutatation is destroyed. We did this properly from
~reader, but it would not be called if reader construction failed,
which it may.
Message-Id: <1484572581-6537-1-git-send-email-tgrabiec@scylladb.com>
2017-01-16 13:26:30 +00:00
Paweł Dziepak
e03868c226 tests: run with all features enabled
Since ce083308a1
"random_mutation_generator: Generate RTs by default" random mutation
generator produces range tombstones. However, so far the tests were run
with all features disabled (because of incomplete initialization of all
services) which meant that RANGE_TOMBSTONE feature was not enabled and
the code couldn't handle range tombstones that weren't just prefixes.

This patch solves the problem by forcing all features to be enabled when
tests are run.
Message-Id: <20170116103324.22956-1-pdziepak@scylladb.com>
2017-01-16 11:38:45 +01:00
Tomasz Grabiec
3c3a4358ae storage_proxy: Fix capturing of on-stack variable by reference
partition_range_count was accepted by do_with callback by value and
then captured by reference by async code, thus invoking use after
destroy.

Message-Id: <1484317846-14485-1-git-send-email-tgrabiec@scylladb.com>
2017-01-16 11:49:11 +02:00
Avi Kivity
c314047b6c config: disable new sharding algorithm
It still has problems:
 - while resharding a very large leveled compaction strategy table, a huge
   amount of tiny sstables are generated, overwhelming the file descriptor
   limits
 - there is a large impact on read latency while resharding is going on

(cherry picked from commit cf27d44412)

(forward-ported from branch-1.6)
2017-01-15 10:48:53 +02:00
Tomasz Grabiec
66547e7d7c storage_proxy: Add missing initialization of _short_read_allowed
Dropped by a1cafed370 ("storage_proxy:
handle range scans of sparsely populated tables").

Fixes the failure in update_cluster_layout_tests.TestUpdateClusterLayout test.

Message-Id: <1484317450-13525-1-git-send-email-tgrabiec@scylladb.com>
2017-01-13 16:47:54 +02:00
Takuya ASADA
bee7f549a9 scylla-housekeeping: move uuid file to /var/lib/scylla-housekeeping
Since scylla-housekeeping running as scylla user, it doesn't have a permission
to create a file on /etc/scylla.d.
So introduce /var/lib/scylla-housekeeping which owns by scylla user, place uuid
file on the directory.

Fixes #2009

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484235946-12463-1-git-send-email-syuu@scylladb.com>
2017-01-13 16:27:53 +02:00
Avi Kivity
c227e3e706 Merge "move a few files in the ScyllaDB project to use the new metrics registration API" from Vlad
* 'rearrange-scylla-collectd-stats-registration-v3' of github.com:cloudius-systems/seastar-dev:
  thrift::server: move collectd counters registration to the metrics registration layer
  gms::gossiper: move collectd counters registration to the metrics registration layer
  utils::logalloc: move collectd counters registration to metrics registration layer
  streaming::stream_manager: move a collectd counters registration to the metrics registration layer
  db::commitlog::commitlog: move collectd counters registration to the metrics registration layer
  sstables::compaction_manager: move collectd metrics registration to the metrics registration layer
  db::batchlog_manager: move collectd registration to the metrics registration layer
  transport::server: move collectd metrics registration to the metrics registration layer
  cql3::query_processor: move collectd metrics registration to the metrics registration layer
  database: move collectd registrations to metrics registration layer
  tracing::trace_keyspace_helper: move collectd metrics registration to a metric registration layer
  tracing::trace_keyspace_helper: fix alignment
  tracing::tracing: move collectd metrics registration to metrics registration layer
2017-01-12 17:13:08 +02:00
Tomasz Grabiec
1e8151b4f2 storage_proxy: Fix use-after-free on one_or_two_partition_ranges
query_mutations_locally() takes one_or_two_partition_ranges by
reference and requires, indirectly, that it is kept alive until
operation resolves. However, we were passing expiring value to it, the
result of unwrap().

Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test

Another potential problem was that we were dereferencing "s" in the same
expression which move-constructs an argument out of it.

Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>
2017-01-12 15:10:51 +02:00
Takuya ASADA
c07d703d0d dist/redhat/scylla.spec.in: fix typo of scylla_cpuscaling_setup
Fix packaging error

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484191955-28006-2-git-send-email-syuu@scylladb.com>
2017-01-12 12:13:33 +02:00
Takuya ASADA
0e6df2a82e dist: follow DPDK script renaming
On DPDK 16.11 dpdk_nic_bind.py is renamed to dpdk-devbind.py, so we are
getting "file not found" both on packaging and scripts, fixed that.

Also fixed inconsistent packaging.
Since Seastar copied dpdk_nic_bind.py to its scripts/ directory, there're two
different versions of the script, .rpm/.deb packaging different one:
 dist/redhat: seastar/dpdk/tools/dpdk_nic_bind.py
 dist/ubuntu: seastar/scripts/dpdk_nic_bind.py

That's won't work because we sharing setup scripts between two
distributions, so I changed dist/ubuntu package to use DPDK one.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484191955-28006-1-git-send-email-syuu@scylladb.com>
2017-01-12 12:13:33 +02:00
Gleb Natapov
76aed548e3 storage_proxy: add replica side counters for data read
Message-Id: <20170112085907.GN11469@scylladb.com>
2017-01-12 11:41:04 +02:00
Vlad Zolotarov
ca0a0f1458 tracing::trace_keyspace_helper: use generate_legacy_id() for CF IDs generation
Explicitly generate tables' IDs of tables from the system_traces KS  using
generate_legacy_id() in order to ensure all Nodes create these tables with
the same IDs.

This is going to prevent hitting issue #420.

Fixes #1976

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1484153725-31030-1-git-send-email-vladz@scylladb.com>
2017-01-12 11:36:35 +02:00
Tomasz Grabiec
33e1f9af6b sstables: Close input_stream from random_access_reader
Spotted by destroy-without-close detector.
Message-Id: <1484072527-13058-1-git-send-email-tgrabiec@scylladb.com>
2017-01-11 09:40:00 +00:00
Duarte Nunes
ce083308a1 random_mutation_generator: Generate RTs by default
This patch changes the random_mutation_generator so it generates range
tombstones by default. This fixes a bug where reversibly applying
range tombstones wasn't being tested.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170110164822.28747-1-duarte@scylladb.com>
2017-01-11 09:24:37 +00:00
Vlad Zolotarov
7fb0bab7d7 thrift::server: move collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:55 -05:00
Vlad Zolotarov
eb4fbb3949 gms::gossiper: move collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:55 -05:00
Vlad Zolotarov
022bca16bf utils::logalloc: move collectd counters registration to metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:55 -05:00
Vlad Zolotarov
a850bea820 streaming::stream_manager: move a collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
dcdd98ccc1 db::commitlog::commitlog: move collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
00e37c389b sstables::compaction_manager: move collectd metrics registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
a9f6e5f8da db::batchlog_manager: move collectd registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
3b41d589f8 transport::server: move collectd metrics registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
8d0a2e3883 cql3::query_processor: move collectd metrics registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
cda382e8d6 database: move collectd registrations to metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
af29c3506b tracing::trace_keyspace_helper: move collectd metrics registration to a metric registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
0df37c04f6 tracing::trace_keyspace_helper: fix alignment
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
6267bb63f4 tracing::tracing: move collectd metrics registration to metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Avi Kivity
1ff0eef0a8 intrusive_set_external_comparator: avoid using boost::intrusive::value_traits_pointers
boost::intrusive::value_traits_pointers was introduced in boost 1.56, while
we also support boost 1.55.  Replace with an equivalent expression.

(with additions by Asias)

Message-Id: <20170110084700.19994-1-avi@scylladb.com>
2017-01-10 18:16:56 +02:00
Pekka Enberg
3d0217ec43 db/schema_tables: Fix system keyspace table list
Commit f0c28e1 ("db/schema_tables: Add schema_functions and
schema_aggregates tables") forgot to add the newly added tables to the
db::schema_tables::ALL list, which is used for authorization checks, for
example.

Fixes the following auth_test.py dtest failures:

  ('Unable to connect to any servers', {'127.0.0.1': Unauthorized('Error from server: code=2100 [Unauthorized] message="User cathy has no SELECT permission on <table system.schema_functions> or any of its parents"',)})
Message-Id: <1484045277-4997-1-git-send-email-penberg@scylladb.com>
2017-01-10 13:55:04 +01:00
Avi Kivity
0591303b72 Merge "avoid excessive memory usage during resharding" from Rapahel
"Intended to reduce memory usage when resharding by sharing sstable
components among shards. File descriptors are also shared from now
on, meaning that a much smaller number of file descriptors will be
used during resharding.

Fixes #1951."

branch 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla

* 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla:
  db: avoid excessive memory usage during resharding
  checked_file_impl: add support to dup
  sstables: group sstable components that can be shared among shards
  sstables: rename sstable member
2017-01-09 20:43:50 +02:00
Raphael S. Carvalho
68dfcf5256 db: avoid excessive memory usage during resharding
After resharding, sstables may be owned by all shards, which
means that file descriptors and memory usage for metadata will
increase by a factor equal to number of shards. That can easily
lead to OOM.

SSTable components are immutable, so they can be stored in one
shard and shared with others that need it. We use the following
formula to decide which shard will open the sstable and share
it with the others: (generation % smp::count), which is the
inverse of how we calculate generation for new sstables.
So if no resharding is performed, everything is shard-local.
With this approach, resource usage due to loaded sstables will
be evenly distributed among shards.

For this approach to work, we now only populate keyspaces from
shard 0. It's now the sole responsible for iterating through
column family dirs. In addition, most of population functions
are now free and take distributed database object as parameter.

Fixes #1951.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-09 15:24:36 -02:00
Raphael S. Carvalho
9200e389c2 checked_file_impl: add support to dup
That's needed for sstable fd sharing to work.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-09 13:33:30 -02:00
Avi Kivity
77cb2b452f Merge "CQL 3.3.1 support" from Pekka
"This patch series adds support for CQL 3.3.1. The changes to CQL are listed
here:

  https://github.com/apache/cassandra/blob/cassandra-2.2/doc/cql3/CQL.textile#changes

The following CQL features are already supported by Scylla:

  - TRUNCATE TABLE alias
  - Double-dollar string literals
  - Aggregate functions: MIN, MAX, SUM, and AVG

This series adds the following CQL features:

  - New data types: tinyint, smallint, date, and time
  - CQL binary protocol v4 (required by the new data types)
  - Advertise Cassandra 2.2.8 version from Scylla so that drivers correctly
    detect the presence of CQL 3.3.1

The following CQL features are not supported by Scylla:

  - Role-based access control (issue #1941)
  - JSON data type
  - User-defined functions (UDFs)
  - User-defined aggregates (UDAs)

The following CQL binary protocol v4 changes are not implemented by this
series:

  - Read_failure and Write_failure error codes are not implemented.
    They error codes not used by the smart drivers but as they are
    propagated to application code, we eventually need to wire them up
    to our storage proxy implementation.
  - Function_failure error code is only used by user-defined functions
    and the fromJson function, which are not implemented by Scylla.

Fixes #1284."

* 'penberg/cql-3.3.1/v5' of github.com:cloudius-systems/seastar-dev:
  version: Bump Cassandra version to 2.2.8
  db/schema_tables: Add schema_functions and schema_aggregates tables
  tests/type_tests: TIME type test cases
  tests/cql_query_test: TIME type test cases
  cql3: TIME data type support
  tests/type_tests: DATE type test cases
  tests/cql_query_test: DATE type test cases
  cql3: DATE type support
  date.h: 64-bit year and days representation
  licenses: Add utils/date.h license
  utils/date.h: Import date and time library sources
  tests/type_tests: TINYINT and SMALLINT type test cases
  tests/cql_query_test: TINYINT and SMALLINT type test cases
  cql3: TINYINT and SMALLINT data type support
  types: Fix integer_type_impl::parse_int() for bytes
2017-01-09 11:54:45 +02:00
Avi Kivity
8f36dca6f1 storage_proxy: prevent short read due to buffer size limit from being swallowed during range scan
mutation_result_merger::get() assumes that the merged result may be a
short read if at least one of the partial results is a short read (in
other words, if none of the partial results is a short read, then the
merged result is also not a short read). However this is not true;
because we update the memory accounter incrementally, we may stop
scanning early. All the partial results are full; but we did not scan
the entire range.

Fix by changing the short_read variable initialization from `no`
(which assumes we'll encounter a short read indication when processing
one of the batches) to `this->short_read()`, which also takes into
account the memory accounter.

Fixes #2001.
Message-Id: <20170108111315.17877-1-avi@scylladb.com>
2017-01-09 09:21:43 +00:00
Pekka Enberg
856d0e40fb version: Bump Cassandra version to 2.2.8
Advertise Cassandra 2.2.8 version to the drivers: CQL 3.3.1 language
version and CQL binary protocol version 4 support.
2017-01-09 10:42:21 +02:00
Pekka Enberg
f0c28e1b2d db/schema_tables: Add schema_functions and schema_aggregates tables
The 3.0.3 Java driver, for example, search for the tables and fails when
we advertise Cassandra 2.2 version from Scylla.
2017-01-09 10:42:21 +02:00
Pekka Enberg
10facd7db8 tests/type_tests: TIME type test cases 2017-01-09 10:42:21 +02:00
Pekka Enberg
a49ee9387e tests/cql_query_test: TIME type test cases 2017-01-09 10:42:20 +02:00
Pekka Enberg
93e6592296 cql3: TIME data type support
This adds support for the TIME data type introduced in CQL 3.3.1.

Refs #1284
2017-01-09 10:42:20 +02:00
Pekka Enberg
9ceea7bbc4 tests/type_tests: DATE type test cases 2017-01-09 10:42:20 +02:00
Pekka Enberg
f0cbfb9e4f tests/cql_query_test: DATE type test cases 2017-01-09 10:42:20 +02:00
Pekka Enberg
9def7db381 cql3: DATE type support
This adds support for the DATE type introduced in CQL 3.3.1.

Refs #1284
2017-01-09 10:42:20 +02:00
Pekka Enberg
f83503c09e date.h: 64-bit year and days representation
We need 64-bit year and days representation to support the boundary
values of the CQL data type, which is implemented using Joda Time
library's DateTime type.
2017-01-09 10:42:20 +02:00
Pekka Enberg
41df14f62d licenses: Add utils/date.h license 2017-01-09 10:42:20 +02:00
Pekka Enberg
7f2fc6470c utils/date.h: Import date and time library sources
This patch imports the "date.h" date and time library based on the C++11
<chrono> header, which is proposed for standadization:

  http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0355r1.html

We need it to implement support for the CQL date type.

Import repository

  https://github.com/HowardHinnant/date

Import commit:

  commit 2935f80109b8cfc15eb1243afe35f7ec3530f971
  Author: Howard Hinnant <howard.hinnant@gmail.com>
  Date:   Sun Jan 1 15:02:08 2017 -0500

      Have get_version check for the file named version first
2017-01-09 10:39:54 +02:00
Takuya ASADA
42c1e1e0e8 dist/common/systemd: run node-exporter.service as scylla user
For security reason, we should run node-exporter.service as scylla user,
instead of root.

Fixes #1968

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483543419-16541-1-git-send-email-syuu@scylladb.com>
2017-01-09 09:51:47 +02:00
Paweł Dziepak
3339cced05 sstables: file_writer: make write() non-virtual
Noone overrides file_writer::write() so there is no reason to inhibit
optimisations and cause compiler to emit indirect calls.

Message-Id: <20170104163618.26251-1-pdziepak@scylladb.com>
2017-01-09 09:47:37 +02:00
Takuya ASADA
5422a8e046 dist/ubuntu: generate Ubuntu/Debian revision correctly
Ubuntu Packaging Guide says if there's no upstream package (means it's not
ported from Debian), revision should be "0ubuntu1", not "ubuntu1" which is we
currently using.

On Debian, Debian Policy Manual says it's conventional to restart revision from 1 when upstream version increased, so we should specify it to "1".

To do it in single script, we will generate the revision on building time.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483498658-27491-1-git-send-email-syuu@scylladb.com>
2017-01-09 09:45:46 +02:00
Takuya ASADA
920683a882 dist/common/scripts: add scylla_cpuscaling_setup
To setup cpu scaling governor to 'performance', add new script to do it on
scylla_setup.

Fixes #1895

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483542216-12195-1-git-send-email-syuu@scylladb.com>
2017-01-09 09:44:41 +02:00
Avi Kivity
97ab0d9feb build: track system header changes too
Changes to boost headers should trigger a rebuild if they change.
2017-01-08 20:49:19 +02:00
Avi Kivity
85f4e16336 main: fix incorrect low memory warning
A spurious division by smp::count warns that memory is low even when plenty
is available.  Fix by removing the division.

Fix #2002.

Message-Id: <20170108122216.27233-1-avi@scylladb.com>
Tested-by: Benoît Canet <benoit@scylladb.com>
2017-01-08 15:14:36 +02:00
Amnon Heiman
8cd3d7445c scylla_setup: remove the uuid file creation
Scylla housekeeping can crete a uuid file if it is missing. There is no
longer need to create one for it.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-3-git-send-email-amnon@scylladb.com>
2017-01-08 14:11:04 +02:00
Amnon Heiman
32888fc0aa scylla-housekeeping: Create a uuid file if one is missing
This patch gets housekeeping to create a uuid file if a path to a uuid
file is upplied but the file is missing.

Because it import the uuid lib, uuid parameters where renamed.

Fixes #1987

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-2-git-send-email-amnon@scylladb.com>
2017-01-08 14:11:03 +02:00
Gleb Natapov
9ed3346f98 main: fix error reporting about low memory
Message-Id: <20170108112144.GT1829@scylladb.com>
2017-01-08 13:46:48 +02:00
Raphael S. Carvalho
eed2a7d065 sstables: group sstable components that can be shared among shards
We intend to share immutable sstable components among shards to
reduce excessive memory usage when resharding shared sstables.

This change is about grouping those components into a structure,
and using foreign ptr to make sure that the structure will be
deleted by whichever shard created it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-06 15:16:19 -02:00
Raphael S. Carvalho
a492f8dfaf sstables: rename sstable member
Rename _components to _recognized_components because _components
will be used to name a field with shareable components.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-06 15:16:17 -02:00
Avi Kivity
38b2fa27ad Merge seastar upstream
* seastar 1c8e389...240b0bf (15):
  > file/dup: don't decrease refcnt twice when file is explicitly closed
  > reactor: Add missing CentOS 7.2 dependency systemtap-sdt-devel
  > reactor: Cleaning the smp queue metrics when shuting down
  > metrics: metrics keep the value map while unregistering
  > change the reactor load metrics to utilization
  > Merge "ASan fiber switches" from Paweł
  > tls: Add missing credentials_builder::set_client_auth method
  > collectd: create metrics with the right format
  > io_queue: remove owner number from metric name
  > reactor: change the load metric name to load
  > Merge "reactor: stop using signals for task_quota timer"
  > metrics: Allow initializing the metric_group in its constructor
  > Update DPDK to 16.11
  > Revert "rpc: Avoid using zero-copy interface of output_stream"
  > core::metrics_groups: add a clear() method
2017-01-06 16:34:51 +02:00
Vlad Zolotarov
492295eb7f init: move supervisor_notify() out of main.cc
Transform the supervisor_notify() and related functions into
the "supervisor" class and place this class implementation in
a separate .cc file.

This is going to fix the compilation breakage of tests introduced
by a

commit 8014adc2a1

    init: serialize the creation of system_traces KS objects

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1483663955-20096-1-git-send-email-vladz@scylladb.com>
2017-01-06 10:10:55 +00:00
Avi Kivity
be11b054e1 Merge "Reduce the size of mutation_partition" from Piotr
"Reduce the size of mutation_partition by implementing intrusive set using
bi::rbtree_algorithms directly and using tree nodes optimized for size.

This will reduce the size of mutation_partition by:
24 bytes + <number of cql rows> * 8 bytes

This should have a positive impact on performance because mutation_partitions
are stored both in memtable and cache.

Fixes #742."

* 'haaawk/742' of github.com:cloudius-systems/seastar-dev:
  intrusive_set: rename size() to calculate_size()
  Make intrusive_set_external_comparator::_value_traits static
  Implement intrusive set using rbtree_algorithms
  mutation_partition: make apply_reversibly_intrusive_set nongeneric
  mutation_partition: take schema in find_row and clustered_row
  mutation_partition: Extract intrusive set logic to a class.
  mutation_partition: Replace value_comp with key_comp calls
2017-01-05 17:34:10 +02:00
Tomasz Grabiec
cd630fece6 db: Make system tables use the commitlog
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.

Fixes #1986.

Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.

Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.

Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
2017-01-05 14:53:51 +02:00
Avi Kivity
eb520e7352 storage_proxy: fix result ordering for parallel partition range scans
During a range scan, we try to avoid sorting according to partition range
when we can do so.  This is when we scan fewer than smp::count shards --
each shard's range is strictly ordered with respect to the others.

However, we use the wrong key for the sort -- we use the shard number.  But
if we started at shard s > 0 and wrapped around to shard 0, then shard 0's
range will be after the range belonging to shard s, but will sort before it.

Fix by storing the iteration order as the sort key.  We use that when we
know that shards do not overlap (shards < smp::count) and the index within
the source partition range vector when they do.

Fixes #1998.
Message-Id: <20170105114253.17492-1-avi@scylladb.com>
2017-01-05 12:51:37 +01:00
Vlad Zolotarov
8014adc2a1 init: serialize the creation of system_traces KS objects
Serialize the creation of a system_traces KS objects when
they do not exist - the initial cluster boot.
Avoid creating them in parallel by different cluster Nodes
in order to avoid issue #420.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1483552503-12873-3-git-send-email-vladz@scylladb.com>
2017-01-05 12:41:38 +01:00
Vlad Zolotarov
d3b8b67e66 service::storage_service: serialize the system_auth KS initialization
Move the system_auth KS initialization to be before Node moves to the NORMAL
state. This way we will serialize this code running on different Nodes and
avoid hitting issue #420.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1483552503-12873-2-git-send-email-vladz@scylladb.com>
2017-01-05 12:36:06 +01:00
Piotr Jastrzebski
b159e08764 intrusive_set: rename size() to calculate_size()
This hopefully will make it more apparent that
the time complexity of this method is O(N) not O(1).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 12:21:43 +01:00
Piotr Jastrzebski
b47a296053 Make intrusive_set_external_comparator::_value_traits static
_value_traits can be shared among all instances
and there's no need to store it in every single one.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 12:21:10 +01:00
Avi Kivity
4667641f5f result_memory_tracker: fix too-short short reads
1.6 truncates paged queries early to avoid overrunning server memory
with too-large query results, but in the case of partition range queries,
this terminates too early due to an uninitialized variable holding the
maximum result size.  This results in slow performance due to additional
round trips.

Fix by initializing the maximum result size from the result_memory_tracker
running on the coordinating shard.

Fixes #1995.
Message-Id: <20170105103915.10633-1-avi@scylladb.com>
2017-01-05 10:51:55 +00:00
Piotr Jastrzebski
041b0a65ac Implement intrusive set using rbtree_algorithms
This new implementation takes less memory because it
does not store comparator.

It also uses tree nodes optimized for size. This means
that instead of storing an enum field |color| they embed
this information inside pointer to parent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:46:58 +01:00
Piotr Jastrzebski
a0c20f5c49 mutation_partition: make apply_reversibly_intrusive_set nongeneric
apply_reversibly_intrusive_set is used only in one place
and always with rows_type. There's no need for it to be generic.
This will allow changing intrusive set implementation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
4bbe05dd47 mutation_partition: take schema in find_row and clustered_row
This will allow intrusive set implementation that does not
store schema.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
fe3c91db90 mutation_partition: Extract intrusive set logic to a class.
It will make it easier to change the implementation
of the intrusive set.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
da67ac7ae4 mutation_partition: Replace value_comp with key_comp calls
This will reduce the size of bi::set API being used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Pekka Enberg
0ea5652354 tests/type_tests: TINYINT and SMALLINT type test cases 2017-01-05 10:57:35 +02:00
Pekka Enberg
41e3327ebc tests/cql_query_test: TINYINT and SMALLINT type test cases 2017-01-05 10:57:35 +02:00
Pekka Enberg
fcaa743e3d cql3: TINYINT and SMALLINT data type support
This adds support for the TINYINT and SMALLINT data types introduced in
CQL 3.3.1.

Refs #1284
2017-01-05 10:57:35 +02:00
Pekka Enberg
257fa541f1 types: Fix integer_type_impl::parse_int() for bytes
The integer_type_impl::parse_int() function uses boost::lexical_cast()
under the hood, which parses 8-bit numbers as characters. Fix the
function to lexical cast to 64-bit integer and convert the result to
integer_type_impl template type.
2017-01-05 10:57:35 +02:00
Nadav Har'El
45f19f2633 main: better error message on failing to start Prometheus
Previously, if the Prometheus port (by default, 0.0.0.0:9180) could not
be opened, the following message appeared in the log about 10 seconds into
the run, and Scylla crashed.

ERROR 2017-01-01 19:31:04,066 [shard 0] seastar - Exiting on unhandled exception: std::system_error (error system:98, Address already in use)

The puzzled user would have no idea *which* address was already in use, why,
or why Scylla stopped.

In this patch, before the above message we get the much more informative
message:

ERROR 2017-01-01 19:58:19,080 [shard 0] init - Could not start Prometheus API server on 0.0.0.0:9180: std::system_error (error system:98, Address already in use)

We continue to print the original message - and exit - in this case,
under the assumption that it's better not to run the database while
improperly configured.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170102121304.2060-1-nyh@scylladb.com>
2017-01-04 14:58:26 +02:00
Tzach Livyatan
0c746b22e0 Fix a typo in scylla_setup housekeeping prompt
Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <1483362474-22113-1-git-send-email-tzach@scylladb.com>
2017-01-04 14:54:22 +02:00
Takuya ASADA
43655512e1 dist/redhat: add python-setuptools on dependency since it requires for scylla-housekeeping
scylla-housekeeping breaks when python-setuptools doesn't installed, so
add it on dependency.

Fixes #1884

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483525828-7507-1-git-send-email-syuu@scylladb.com>
2017-01-04 14:32:10 +02:00
Pekka Enberg
060841b756 tests/types_test: Fix int32 type string conversion boundary case
The test case is interested in the upper boundary of 32-bit integer
because we already test the lower boundary in assertions below. The old
test passed, of course, but it wasn't very interesting.
Message-Id: <1483522773-6008-1-git-send-email-penberg@scylladb.com>
2017-01-04 11:57:02 +01:00
Avi Kivity
3232d47d4f dist: remove another bc dependency
No longer used.
2017-01-01 11:13:34 +02:00
Tzach Livyatan
2bfa7cc086 dist/common/scripts: improve scylla_setup wording
Fix a few minor typos and improve the user prompt text

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <1482918340-19375-1-git-send-email-tzach@scylladb.com>
2016-12-30 13:18:08 +02:00
Tzach Livyatan
436ce7ae49 conf/scylla.yaml: Move broadcast_rpc_address to the supported section
Fixes #1779

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <1483021417-8415-1-git-send-email-tzach@scylladb.com>
2016-12-29 16:24:56 +02:00
Takuya ASADA
e48cc9cf01 dist/ubuntu: check lsb_release existance since it's not included minimal Debian installation
Ubuntu has it in minimal installation but Debian doesn't, so add it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483003565-2753-1-git-send-email-syuu@scylladb.com>
2016-12-29 11:33:21 +02:00
Pekka Enberg
a443dfa95e tracing: Add seastar/core/scollectd.hh include
Fix the following build breakage:

FAILED: build/release/gen/cql3/CqlParser.o
g++ -MMD -MT build/release/gen/cql3/CqlParser.o -MF build/release/gen/cql3/CqlParser.o.d -std=gnu++1y -g  -Wall -Werror -fvisibility=hidden -pthread -I/home/penberg/scylla/seastar -I/home/penberg/scylla/seastar/fmt -I/home/penberg/scylla/seastar/build/release/gen  -march=nehalem -Ifmt -DBOOST_TEST_DYN_LINK -Wno-overloaded-virtual -DFMT_HEADER_ONLY -DHAVE_HWLOC -DHAVE_NUMA -DHAVE_LZ4_COMPRESS_DEFAULT  -O2 -DBOOST_TEST_DYN_LINK  -Wno-maybe-uninitialized -DHAVE_LIBSYSTEMD=1 -I. -I build/release/gen -I seastar -I seastar/build/release/gen -c -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp
In file included from ./query-request.hh:31:0,
                 from ./locator/token_metadata.hh:51,
                 from ./locator/abstract_replication_strategy.hh:29,
                 from ./database.hh:26,
                 from ./service/storage_proxy.hh:44,
                 from ./db/schema_tables.hh:43,
                 from ./db/system_keyspace.hh:46,
                 from ./cql3/functions/function_name.hh:45,
                 from ./cql3/selection/selectable.hh:48,
                 from ./cql3/selection/writetime_or_ttl.hh:45,
                 from build/release/gen/cql3/CqlParser.hpp:63,
                 from build/release/gen/cql3/CqlParser.cpp:44:
./tracing/tracing.hh:357:5: error: ‘scollectd’ does not name a type
     scollectd::registrations _registrations;
     ^~~~~~~~~

Message-Id: <1482939751-8756-1-git-send-email-penberg@scylladb.com>
2016-12-28 18:40:18 +02:00
Nadav Har'El
d49aa7abd2 storage_service: make is_joined() an immediate function
Commit d41cd48a made the is_joined() method a future<bool> because
only cpu 0 knows its real value. This makes this function inconvenient
to use. So this patch reverts commit d41cd48a, and instead sets this
flag's value on all shards, so each shard can read its value locally
(and immediately).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161228160450.5831-1-nyh@scylladb.com>
2016-12-28 18:37:22 +02:00
Pekka Enberg
2aee7f6334 Merge seastar upstream
* seastar f32e4c2...1c8e389 (2):
  > Merge "migrate network related seastar collectd metrics to the new metrics registration API" from Vlad
  > file: add dup() support
2016-12-28 17:04:11 +02:00
Duarte Nunes
1444a52fae position_in_partition: Add tri_comparator
Will be needed to order view updates with the existing mutations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
[pdziepak: corrected component name in commit message]
Message-Id: <1482880989-3086-2-git-send-email-duarte@scylladb.com>
2016-12-28 13:04:16 +01:00
Duarte Nunes
c6b0387f31 clustering_bounds_comparator: Add tri_comparator
This patch adds a tri_comparator for bound_view, which will be used by
to add a tri comparator to position_in_partition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482880989-3086-1-git-send-email-duarte@scylladb.com>
2016-12-28 13:02:57 +01:00
Duarte Nunes
adb727f7dc clustering_row: Add apply() overload
This patch adds an overload to the apply() function,
which takes a clustering_row by reference, to copy. This will be
needed by future patches, when merging base table updates with the
existing data.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482881106-3202-1-git-send-email-duarte@scylladb.com>
2016-12-28 12:45:12 +01:00
Pekka Enberg
302035577e cql3/statements: Make batch_statement::_type private
The _type member variable is never accessed outside of the
batch_statement class so make it private.
Message-Id: <1482921073-28485-1-git-send-email-penberg@scylladb.com>
2016-12-28 12:08:05 +01:00
Pekka Enberg
20daf43403 cql3/statements: Move batch_statement implementation to source file
Clean up batch_statement class by moving implementation to the
batch_statement.cc source file to make it easier to modify the class.

Message-Id: <1482920872-28303-1-git-send-email-penberg@scylladb.com>
2016-12-28 12:30:03 +02:00
Duarte Nunes
86a109915d streamed_mutations: Update comments
This patch removes references to the old begin_range_tombstone and
end_range_tombstone mutation_fragments, which have been replaced by a
single range_tombstone fragment.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482880820-2831-1-git-send-email-duarte@scylladb.com>
2016-12-28 09:06:49 +01:00
Gleb Natapov
4ca58959ad storage_proxy: do not deref unengaged stdx:optional
Fixes intentional short reads.

Message-Id: <20161227142133.GE1829@scylladb.com>
2016-12-27 16:30:03 +02:00
Vlad Zolotarov
9606db2f08 api::set_tracing_probability: prevent a server from returning 500 for a bad probability value
- Change an exception type thrown by a tracing::tracing::set_trace_probability()
     to make it different from the one thrown by an std::stod() when it fails to
     parse a given string.
   - Catch the std::out_of_range exception thrown by a tracing::tracing::set_trace_probability() and
     wrap the exception string into the httpd::bad_param_exception() object.
   - Throw a httpd::bad_param_exception() with a
     "Bad format in a probability value: <a user given probability string value>"
     message if std::invalid_argument is caught.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465300738-1557-1-git-send-email-vladz@cloudius-systems.com>
2016-12-27 12:07:09 +02:00
Avi Kivity
339cc0c2fa main: verify sufficient memory per shard
Refuse to boot if we don't have at least 1 GiB per shard, unless in developer
mode.

The primary violator here is docker, but since it starts in developer mode,
it won't get fixed.  We need some extra logic for this case.
Message-Id: <20161221090222.28677-1-avi@scylladb.com>
2016-12-27 12:05:52 +02:00
Avi Kivity
868b4d110c Merge "Fixes for intentional short reads" from Paweł
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.

I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.

It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."

* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
  query: introduce result_memory_accounter::foreign_state
  storage_proxy: fix short reads in parallel range queries
  storage_proxy: pass maximum result size to replicas
  mutation_partition: use result limiter for digest reads
  query: make result_memory_limiter constants available for linker
  result_memory_limiter: add accounter for digest reads
  idl: allow writers to use any output stream
  result_memory_limiter: split new_read() to new_{data, mutation}_read()
  idl: is_short_read() was added in 1.6
  mutation_partition: honour allowed_short_read for static rows
  storage_proxy: fix _is_short_read computation
  storage_proxy: disallow short reads if got no live rows
  storage_proxy: don't stop after result with no live rows
2016-12-26 10:42:49 +02:00
Avi Kivity
1d9ee358f1 Revert "Merge "Reduce the size of mutation_partition" from Piotr"
This reverts commit aa392810ff, reversing
changes made to a24ff47c637e6a5fd158099b8a65f1191fc2d023; it uses
boost::intrusive::detail directly, which it must not, and doesn't compile on
all boost versions as a consequence.
2016-12-25 16:07:48 +02:00
Avi Kivity
59d389bd46 Merge seastar upstream
* seastar 0b98024...f32e4c2 (11):
  > Merge "Moving the reactor counters to the metric layer" from Amnon
  > metrics: Metrics function should take variable as a refernce
  > Revert "Merge ""Moving the reactor counters to the metric layer from Amnon"
  > Merge ""Moving the reactor counters to the metric layer from Amnon
  > Revert "fstream: Auto-close data_sink and data_source"
  > rpc: Avoid resource unit leaks on failure
  > fstream: Auto-close data_sink and data_source
  > http: Move metrics registration to the metrics layer
  > output_stream: add batching to zero copy interface
  > Revert "slab: Move the metrics registration to the metrics layer"
  > slab: Move the metrics registration to the metrics layer
2016-12-25 15:50:09 +02:00
Amnon Heiman
70b2a1bfd4 Set the prometheus prefix to scylla
This patch make the prometheus prefix configurable and set the default
value to scylla.

Fixes #1964

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>
2016-12-25 15:21:53 +02:00
Avi Kivity
b99a0fc076 licenses: clarify that licenses in this directory do not cover entire work 2016-12-25 12:59:38 +02:00
Avi Kivity
aa392810ff Merge "Reduce the size of mutation_partition" from Piotr
"Reduce the size of mutation_partition by implementing intrusive set using
bi::rbtree_algorithms directly and using tree nodes optimized for size.

This will reduce the size of mutation_partition by:
24 bytes + <number of cql rows> * 8 bytes

This should have a positive impact on performance because mutation_partitions
are stored both in memtable and cache.

Fixes #742."

* 'haaawk/742' of github.com:cloudius-systems/seastar-dev:
  intrusive_set: rename size() to calculate_size()
  Make intrusive_set_external_comparator::_value_traits static
  Implement intrusive set using rbtree_algorithms
  mutation_partition: make apply_reversibly_intrusive_set nongeneric
  mutation_partition: take schema in find_row and clustered_row
  mutation_partition: Extract intrusive set logic to a class.
  mutation_partition: Replace value_comp with key_comp calls
2016-12-25 12:56:10 +02:00
Benoît Canet
a24ff47c63 scylla_setup: Use blkid or ls to list potentials block devices
blkid does not list root raw device.

Revert to lsblk while taking care of having a fallback
path in case the -p option is not supported.

Fixes #1963.

Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20161225100204.13297-1-benoit@scylladb.com>
2016-12-25 12:03:40 +02:00
Takuya ASADA
f3e45bc9ef dist/redhat: don't try to adduser when user is already exists
Currently we get "failed adding user 'scylla'" on .rpm installation when user is already exists, we can skip it to prevent error.

Fixes #1958

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1482550075-27939-1-git-send-email-syuu@scylladb.com>
2016-12-25 11:37:25 +02:00
Piotr Jastrzebski
345ed5b6ff intrusive_set: rename size() to calculate_size()
This hopefully will make it more apparent that
the time complexity of this method is O(N) not O(1).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:32:13 +01:00
Piotr Jastrzebski
151fa3aaf0 Make intrusive_set_external_comparator::_value_traits static
_value_traits can be shared among all instances
and there's no need to store it in every single one.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:32:13 +01:00
Piotr Jastrzebski
671affc36c Implement intrusive set using rbtree_algorithms
This new implementation takes less memory because it
does not store comparator.

It also uses tree nodes optimized for size. This means
that instead of storing an enum field |color| they embed
this information inside pointer to parent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:32:13 +01:00
Piotr Jastrzebski
b0f712a4e8 mutation_partition: make apply_reversibly_intrusive_set nongeneric
apply_reversibly_intrusive_set is used only in one place
and always with rows_type. There's no need for it to be generic.
This will allow changing intrusive set implementation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Piotr Jastrzebski
2af6ff68d9 mutation_partition: take schema in find_row and clustered_row
This will allow intrusive set implementation that does not
store schema.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Piotr Jastrzebski
b3b924dec9 mutation_partition: Extract intrusive set logic to a class.
It will make it easier to change the implementation
of the intrusive set.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Piotr Jastrzebski
ac7481f4b2 mutation_partition: Replace value_comp with key_comp calls
This will reduce the size of bi::set API being used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Tomasz Grabiec
f2a63270d1 sstables: Fix double close on index and data files when writing fails
file output streams take the responsibility of closing the file, they
will close the file as part of closing the stream.

During sstable writing we create sstable object and keep file
references there as well. Sstable object also has responsibility for
closing the files, and does so from sstable::~sstable().

Double close was supposed to be avoided by a construct like this:

  writer.close().get();
  _file = {};

However if close() failed, which can happen when write-ahead failed,
_file would not be cleared, and both the writer and sstable would
close the file. This will result in a crash in
append_challenged_posix_file_impl::close(), which is not prepared to
be closed twice.

Another problem is that if exception happened before we reached that
construct, we still should close the writer. Currently we don't, so
there's no double close on the file, but that's a bug which needs to
be fixed and once that's fixed double close on _file will be even more
likely.

The fix employed here is to not keep files inside sstable object when
writing. As soon as the writer is constructed, it's the only owner of
the file.

Fixes #1764.

Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>
2016-12-23 11:44:43 +02:00
Raphael S. Carvalho
fd80499b3d database: make column_family::add_sstable() private again
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <38226308bee2970a91b0e35370d6a646b85ecfe9.1482459877.git.raphaelsc@scylladb.com>
2016-12-23 11:42:16 +02:00
Paweł Dziepak
e6d27ac529 query: introduce result_memory_accounter::foreign_state
Range queries used to be performed sequentially and the shard performing
part of the read was reading state of the merger's memory accounter
directly. Now, they may be performed in parallel so it is safer to just
pass relevant data by value to the intersted shards so that they are not
reading something that another shard is modyfing at the same time.

Since query is done in parallel there is a chance of overread. However,
the parallelism is high only in sparsely populated tables and that's
when the overread is less serious problem.
2016-12-22 17:16:24 +01:00
Paweł Dziepak
49d675223e storage_proxy: fix short reads in parallel range queries
Since a1cafed370 "storage_proxy: handle
range scans of sparsely populated tables" nonsingular range queries may
be performed in parallel on multiple shards. The consequence of this
that result may be added to the merger out of order. This requires more
complex logic for handling short reads.

As soon as mutation_result_merger gets a short read it starts to discard
all subsequently received results that are known to contain partitions
with larger keys.
Then when the final result is being prepared the merger may need to
combine and sorts results which ordering is not known. If at least one
of these results is a short one all partitions with larger keys are
removed.

Due to request being performed in parallel it is possible that even
though there was a short read the merger has got enough live data to
satisfy specified limits. If this has happened the short read flag is
not set on the final result.
2016-12-22 17:16:24 +01:00
Paweł Dziepak
1a52569f7d storage_proxy: pass maximum result size to replicas
We may want to change the default individual result size limit in the
future. If it is provided by the coordinator and not hardcoded in the
replicas this can be done without causing data query digest mismatches
or wasteful mutation query results.
2016-12-22 17:16:23 +01:00
Paweł Dziepak
40176ca2f8 mutation_partition: use result limiter for digest reads
Even if we are performing a digest query we should do proper result
memory accounting so that the result ends exactly in the same place that
it would if it was a data query. This is to avoid digest mismatches
between replicas.
2016-12-22 17:16:23 +01:00
Avi Kivity
8686a59ea5 dht: use nonwrapping_ranges in ring_position_range_sharder
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges.  Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>
2016-12-22 14:40:30 +01:00
Takuya ASADA
7c3b98806d dist/common/scripts/scylla_setup: improve the message of disk selection prompt
Not to confuse users, describe we only list up unmounted disks.

Fixes #1841

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479720708-6021-1-git-send-email-syuu@scylladb.com>
2016-12-22 15:36:46 +02:00
Paweł Dziepak
a7d694654a query: make result_memory_limiter constants available for linker 2016-12-22 13:35:04 +01:00
Paweł Dziepak
a0523df8d6 result_memory_limiter: add accounter for digest reads
Digest reads differ from data reads in a way that they do not really
consume any memory. We still want them to stop in the same place that
data reads would, but the per-shard semaphore shouldn't be updated by
them.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
38ee69dee0 idl: allow writers to use any output stream
Original IDL generated code was hardcoded to always use bytes_ostream.
This patch makes the output stream a template parameter so that any
valid output stream can be used.
Unfortunately, making IDL writers generic requires updates in the code
that uses them, this is fixed in C++17 which would be able to deduce the
parameter in most cases.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
aa083d3d85 result_memory_limiter: split new_read() to new_{data, mutation}_read()
For data queries it is very important that all replicas get limited in
the same place (this includes replicas returning only digest). That's
why they shouldn't be affected by per-shard result memory limit.
Moreover, we should make sure that individual memory limits are the
same, making the coordinator provide it for replicas which allow to
safely change it in the future.

Mutation queries are not as sensitive but it is still beneficial to make
sure that all replicas use the same individual limit.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
b8e29cc99c idl: is_short_read() was added in 1.6 2016-12-22 13:35:04 +01:00
Paweł Dziepak
1c7cade559 mutation_partition: honour allowed_short_read for static rows 2016-12-22 13:35:04 +01:00
Paweł Dziepak
a7a454c388 storage_proxy: fix _is_short_read computation 2016-12-22 13:35:04 +01:00
Paweł Dziepak
8c1e4a707c storage_proxy: disallow short reads if got no live rows
If after reconciliation the coordinator ends up with no live rows and
short reads are allowed a retry may not make any progress if replicas
end their reads in the same place. The solution is to disallow short
reads on retries which are caused by final result having no live rows.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
6db262446f storage_proxy: don't stop after result with no live rows
mutation_result_merger merges results from different shards and stops as
soon as a shard returned a short read or memory usage on the merging
shard is too high. However, it should never stop unless at least one
live rows is in the merged result.
2016-12-22 13:35:04 +01:00
Avi Kivity
74ecd7072a Merge "Reduce overhead of get_max_purgeable_timestamp() during compaction" from Tomasz
* 'tgrabiec/calculate-hash-once-compaction' of github.com:cloudius-systems/seastar-dev:
  sstables: Calculate key hash only once during compaction
  tests: sstables: Add more test cases to tombstone_purge_test
  db: Expose column_family::add_sstable
  tests: sstables: Ensure timestamps are increasing
  tests: sstables: Simplify tombstone_purge_test
2016-12-22 14:33:30 +02:00
Tomasz Grabiec
045b9fd7c1 sstables: Calculate key hash only once during compaction
Improves compaction performance.
2016-12-22 13:24:46 +01:00
Tomasz Grabiec
fb8765bef9 tests: sstables: Add more test cases to tombstone_purge_test 2016-12-22 13:24:46 +01:00
Tomasz Grabiec
c7ff2a2bb0 db: Expose column_family::add_sstable
Needed by compaction tests.
2016-12-22 13:24:46 +01:00
Tomasz Grabiec
d841cab02c tests: sstables: Ensure timestamps are increasing 2016-12-22 13:24:45 +01:00
Tomasz Grabiec
21ade8e4a4 tests: sstables: Simplify tombstone_purge_test
- moved to seastar thread

  - extracted sstable creation and validation logic

  - reduced code duplication

  - switched to mutation_reader assertions

  - used result of compact_sstable() to locate the new sstable

  - rather than setting gc timestamp in the past, bump the clock
    before compacting
2016-12-22 13:24:41 +01:00
Tomasz Grabiec
bc6486b304 Use gc_clock instead of db_clock where possible
Some code paths were obtaining db_clock timestamp to only convert it
to gc_clock later. Avoid this. In the future we could make gc_clock
cheaper cause it has low precision.

Message-Id: <1482401190-2035-1-git-send-email-tgrabiec@scylladb.com>
2016-12-22 13:27:55 +02:00
Raphael S. Carvalho
c26090a6b2 sstables/compress: fix error message for snappy uncompression
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <898ad07db705355bdbf780afdb3aa982b8ca3823.1482364125.git.raphaelsc@scylladb.com>
2016-12-22 09:08:34 +01:00
Raphael S. Carvalho
27fb8ec512 db: avoid excessive disk usage during sstable resharding
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.

Fixes #1952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
2016-12-21 23:18:06 +02:00
Tomasz Grabiec
d87d50dc64 db: Use microsecond precision for server-side timestamps
Currently server-side timestamps use a clock with millisecond
precision. Timestamps have microsecond resolution, with lower bits
used to serialize mutations originating from given client.

Timestamps for column drops always use just the millisecond base. A
column drop which is executed after an insert may thus be given lower
timestamp than the insert, even when the two are serialized on the
client side over same connection.

Use microsecond precision to reduce chances of that event.

This is supposed to fix sporadic failures of
schema_test.py:TestSchema.drop_column_queries_test dtest.
Message-Id: <1482343119-27698-1-git-send-email-tgrabiec@scylladb.com>
2016-12-21 18:03:22 +00:00
Avi Kivity
875635554d Merge "educe overhead of partition presence checker during cache update" from Tomasz
Refs #1943.

* 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev:
  db: Compute key hash once in partition_presence_checker
  bloom_filter: Allow checking presence using pre-hashed key
  db: Use incremental selector in partition_presence_checker
2016-12-21 14:24:54 +02:00
Takuya ASADA
d356c21512 configure.py: don't allow to run multiple 'ninja -C seastar' on same time
Scylla's build.ninja allows to run multiple 'ninja -C seastar' on same time,
it breaks DPDK build after upgraded to DPDK-16.10:
https://gist.github.com/syuu1228/4bd1170630b7e5f15653281b4728e521

To prevent it, we need to limit number of seastar build only one in same time.

Note: it doesn't mean disabling parallel build on Seastar.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1482250560-20289-1-git-send-email-syuu@scylladb.com>
2016-12-21 12:42:52 +02:00
Vlad Zolotarov
62cad0f5f5 tracing: don't start tracing until a Tracing service is fully initialized
RPC messaging service is initialized before the Tracing service, so
we should prevent creation of tracing spans before the service is
fully initialized.

We will use an already existing "_down" state and extend it in a way
that !_down equals "started", where "started" is TRUE when the local
service is fully initialized.

We will also split the Tracing service initialization into two parts:
   1) Initialize the sharded object.
   2) Start the tracing service:
      - Create the I/O backend service.
      - Enable tracing.

Fixes issue #1939

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>
2016-12-21 12:40:14 +02:00
Gleb Natapov
0a2dd39c75 messaging_service: move MUTATION_DONE messages to separate connection
If a node gets more MUTATION request that it can handle via RPC it will
stop reading from this RPC connection, but this will prevent it from
getting MUTATION_DONE responses for requests it coordinates because
currently MUTATION and MUTATION_DONE messages shares same connection.

To solve this problem this patches moves MUTATION_DONE messages to
separate connection.

Fixes: #1843

Message-Id: <20161201155942.GC11581@scylladb.com>
2016-12-21 11:10:15 +02:00
Piotr Jastrzebski
3e502de153 mutation_partition: don't use unique_ptr to manage LSA objects
Unique_ptr won't destruct them correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>
2016-12-21 09:40:15 +01:00
Raphael S. Carvalho
e28537b56f sstables: fix calculation of memory footprint for summary
size of keys weren't taken into account, so value reported
via collectd is much smaller than actual footprint.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3ca24612e4e84d1cbdea4f2d79e431a4f4479291.1482255327.git.raphaelsc@scylladb.com>
2016-12-20 18:28:47 +00:00
Paweł Dziepak
d0e61fd092 test.py: remove '.cc' from view_schema_test 2016-12-20 18:26:52 +00:00
Avi Kivity
3989e4ed15 Revert "config, dht: reduce default msb ignore bits to 4"
This reverts commit b81a57e8eb.

With exponential range scanning, we should now be able to survive
msb ignore bits of 12, which allows better sharding on large clusters.
2016-12-20 19:41:05 +02:00
Duarte Nunes
a9e5b7f124 view_info: Fix comparison
Two view_info object are equal if their fields are equal, not
different.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482253839-2736-1-git-send-email-duarte@scylladb.com>
2016-12-20 18:36:39 +01:00
Avi Kivity
a1cafed370 storage_proxy: handle range scans of sparsely populated tables
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard.  With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.

Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).

If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation.  This
optimizes for the dense(r) case, where few partial results are required.

If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys.  This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.

We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.

[tgrabiec: trivial conflicts]

Message-Id: <20161220170532.25173-1-avi@scylladb.com>
2016-12-20 18:32:29 +01:00
Tomasz Grabiec
dc94bd0642 Merge branch 'materialized-views/cql/v4' from git@github.com:duarten/scylla.git
This patchset implements the multiple CQL3 statements relating to
materialized views, as well as ensuring other statements now take
materialized views into account. It also adds the necessary internal
data structures to hold materialized view metadata.
2016-12-20 14:21:18 +01:00
Duarte Nunes
8ac4d7b2e8 tests: Add view_schema_test
This patch adds a set of tests for materialized view schema
handling, complementing the dtests for the same feature.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
eb25a8f3cd cql_test_env: Add do_with_cql_env_thread function
This patch introduces the do_with_cql_env_thread() function, which
behaves like do_with_cql_env() except that it executes the
user-specified function in the context of a Seastar thread.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
124802e196 cql3: Add function to build view's select statement
This patch adds an utility function that creates a raw select
statement from a set of columns and a where clause. It is intended to
be used to create the prepared select statement used by the view
class.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
088dfdb108 select_statement: Consider materialized views
This patch considers materialized views in
select_statement::check_access().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5511dab914 cql3: Add drop view statement
This patch adds the drop_view_statement, which enables users to drop a
given materialized view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5c51a24217 cql3: Parse drop view statement
This patch adds the necessary grammar to Cql.g to parse drop view
statements.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
3025ea63fc cql3: Add alter view statement
This patch adds the alter_view_statement, which enables users to
change the properties of a materialized view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
71b1e7c056 cql3: Parse alter view statement
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
8792fed651 create_view_statement: Complete implementation
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
02bc0d2ab3 create_view_statement: Require MV feature
This patch adds the MATERIALIZED_VIEWS_FEATURE to the set of cluster
features and requires its presence to allow creating a view. This
ensures view schemas can be safely propagated across nodes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
59682c95a1 create_view_statement: Require experimental switch
Creating a materialized view requires running Scylla with the
experimental switch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
c626c983f4 create_view_statement: Reuse validation code
This replace some validation logic with a call to
validation::validate_column_family.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5bd74abee8 create_view_statement: Implement check_access
This patch implements check_access according to Cassandra's
implementation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
a9c17b0a52 select_statement: Propagate for_view argument
This patch propagates the for_view argument, used by
statement_restrictions to ensure IS NOT NULL can be used when creating
a materialized view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
65535b3444 modification_statement: Check access for tables with views
This patch checks for additional permissions when modifying a table
with views, since that update will require reading from the table and
writing into its views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5187fdbb3a modification_statement: Views aren't updated directly
This patch ensures that views cannot be modified directly through an
insert or update statement.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
21e34c5054 alter_type_statement: Consider materialized views
This patch ensures we also update materialized views where the type
being updated occurs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
a5b7b0464b migration_manager: Only drop table without views
This patch forbids dropping a column family if there are still views
associated with it, and also forbids dropping a view through the drop
table statement.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
76276f1a53 alter_table_statement: Update materialized view
This patch ensures that changes to a base table's schema
are reflected in that table's materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
44a1f2d836 query_processor: Use cql3::util::do_with_parser()
To minimize code duplication, have query_processor use
do_with_parser() instead of manually creating the CqlParser.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
bd1e66f411 cql3: Allow renaming a column in a where clause
This patch adds an utility function to rename a column occurring a
textual where clause. It is intended to change a view's where clause
when users alter the underlying base table.

To do this, we rely on functions that transform a textual where clause
into a set of relations, which allows to reliably rename the  column.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
ced4b6e4ff cql3: Allow renaming an identifier in a relation
This patch adds an utility function to rename an identifier
occurring in a cql3 relation. This function will be used when renaming
an identifier in a view's where clause.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
282c023524 migration_manager: Announce view drop
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
99aa8eb4b8 migration_manager: Announce view update
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
6ef3358321 migration_manager: Announce new view creation
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
8ce21a9c01 schema_tables: Make drop view mutations
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
61a5a74ea2 schema_tables: Make update view mutations
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
2098c336d9 schema_tables: Make create view mutations
This patch builds the mutations to announce a new view. Aside from
including the view schema, we include the base table mutations so
that a node is resilient against receiving create view mutations
before the base table create mutations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
19a76a82e8 frozen_schema: Support view schemas
This patch allows a view schema to be frozen. To unfreeze such a
schema, we add an is_view attribute to the schema idl.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
c11eb30225 schema_tables: Replace add_table_to_schema_mutation
This patch replaces the add_table_to_schema_mutation() function with
add_table_or_view_to_schema_mutation().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
04b93ba803 schema_tables: Make view mutations
This patch adds functions that translate a view schema to the
corresponding mutations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
fe632e8ba5 schema_tables: Factor out duplicate code
This patch factors out duplicate code between
merge_tables() and merge_views().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
3fd79bb6d6 schema_tables: Merge views for schema merging
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
06ab61a570 schema_tables: Extract update_column_family
This patch extracts update_column_family from schema_tables into
database so it can be used when adding materialized views, in future
patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
ecc4290bc6 database: Remove view from base table upon drop
This patch changes the drop_column_family() function to remove
a view schema from the list of views of its base table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
4f166cfa6a database: Parse views schema table upon init
This patch adds code for parsing the views schema table upon init and
also ensures that when adding a view column family, that we add it to
its base table list of views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
40c684b5f5 database: Extract common create cf code
This patch moves some duplicate code into the
add_column_family_and_create_directory() function. It also saves some
superfluous keyspace lookups and readies the code to be used by
materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
42242273f6 schema_tables: Create views from mutations
This patch enables views to be created from their low-level,
mutation-based representation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
888a8923c7 read_table_mutations: Support other schemas
This patch changes read_table_mutations() so that it can now
read schemas from other tables besides the column families
schema table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
93458f314c migration_manager: Notify of view schema changes
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
22d8aa9bb6 migration_listener: Listen for view schema changes
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
b9cf25c4dd schema_tables: Add views schema table
This patch adds the views schema table, containing the definition of
views in a keyspace.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
e41494996f thrift: Skip materialized views
This patch ensures we don't provide access to materialized views over
thrift. This includes preventing updates but also omitting them when
describing a keyspace.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
2b231f22b8 keyspace_metadata: Add tables() and views() functions
This patch adds utility functions to keyspace_metadata to select only
the tables or only the views out of all the schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
7818339791 materialized views: Add view class
This patch adds the view class, which will contains functions related
to populating a view, either from the base table's write path or from
the view building mechanism which copies over already existing data in
the base table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
d0ed8fa29b schema: Add view_ptr class
The view_ptr class contains a schema_ptr known to represent a
materialized view. It is intended to be used by functions that require
such a schema, and thus obviate the need for the function to check for
schema::is_view().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
82ce8eedbd schema: Add view_info field
This patch adds a view_info optional field to the schema. It's
presence indicates the schema represents a materialized view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
4b3ac42914 materialized views: Add view_info class
The view_info class is meant to augment a schema with
fields relevant for materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
d7e607ff51 query_pagers: Fix over-counting of rows
This patch fixes a regression introduced in 0518895, where we counted
one extra row per partition when it contained live, non static rows.

We also simplify the visitor logic further, since now we don't need to
count rows one by one. Also remove a bunch of unused fields.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482234083-2447-1-git-send-email-duarte@scylladb.com>
2016-12-20 11:58:37 +00:00
Tomasz Grabiec
0e487b3499 db: Compute key hash once in partition_presence_checker
I measured reduction of cache update time by 20% for 6 sstables and by
40% for 16.

Refs #1943.
2016-12-19 14:20:58 +01:00
Tomasz Grabiec
ab5c77fcf1 bloom_filter: Allow checking presence using pre-hashed key
Will allow us to calculate the hash once and use it on many filters
instead of calculating the hash for each filter separately.

Another change made is to avoid precomputing all indexes during filter
operations, and have for_each_index() template instead which invokes a
functor.
2016-12-19 14:20:58 +01:00
Tomasz Grabiec
78844fa2e5 db: Use incremental selector in partition_presence_checker
This reduces the number of sstables we need to check to only those
whose token range overlaps with the key. Reduces cache update
time. Especially effective with leveled compaction strategy.

Refs #1943.

Incremental selector works with an immutable sstable set, so cache
updates need to be serialized. Otherwise we could mispopulate due to
stale presence information.

Presence checker interface was changed to accept decorated key in
order to gain easy access to the token, which is required by
the incremental selector.
2016-12-19 14:20:58 +01:00
Avi Kivity
b740aff777 tests: adjust mutation_query_test for partition and row limits
Won't build otherwise.
2016-12-19 11:37:25 +02:00
Avi Kivity
f3c8cbbac5 Merge "Introduce dht::token_range an dht::partition_range" from Asias
"nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.

Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide."

* tag 'asias/repair_dht_token_range/v2' of github.com:cloudius-systems/seastar-dev:
  Convert to use dht::partition_range_vector and dht::token_range_vector
  dht: Introduce dht::partition_range_vector and dht::token_range_vector
  Get rid of query::partition_range
  Convert to use dht::partition_range
  Convert to use dht::token_range
  dht: Rename token_range to token_range_endpoints
  dht: Introduce dht::token_range an dht::partition_range
2016-12-19 10:59:52 +02:00
Asias He
937f28d2f1 Convert to use dht::partition_range_vector and dht::token_range_vector 2016-12-19 14:08:50 +08:00
Asias He
7a446986fa dht: Introduce dht::partition_range_vector and dht::token_range_vector
std::vector<dht::partition_range> and std::vector<dht::token_range> are
used in a lot of places, introduce dht::partition_range_vector and
dht::token_range_vector as the alias.
2016-12-19 08:09:28 +08:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Asias He
85034c1b57 Convert to use dht::partition_range 2016-12-19 08:04:30 +08:00
Asias He
d1178fa299 Convert to use dht::token_range 2016-12-19 08:04:29 +08:00
Asias He
1f06eedb58 dht: Rename token_range to token_range_endpoints
It is a helper class used in storage_service only. Rename it so we can
use it for the real dht::token_range.
2016-12-19 08:04:29 +08:00
Asias He
264b6ee69e dht: Introduce dht::token_range an dht::partition_range
nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.

Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide.
2016-12-19 08:04:29 +08:00
Avi Kivity
32fb4c3661 Merge "repair: Reduce unnecessary streaming traffic even more" from Asias
"In 7c873f0d (repair: Reduce unnecessary streaming traffic), we optimize
in cases when 1) all the remote nodes has the same checksum and 2) local node
has zero checksum.

In this series, we make the optimization more generec and cover more cases."

* tag 'asias/repair/node_reducer/v3' of github.com:cloudius-systems/seastar-dev:
  repair: Reduce unnecessary streaming traffic even more
  repair: Add hash specialization for partition_checksum
2016-12-18 16:53:39 +02:00
Avi Kivity
3421ebe8be Merge "storage_proxy: Enforce row limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. Uppers layers like the pager may
already be depending on this behavior."

* 'enforce-row-limit/v3' of https://github.com/duarten/scylla:
  query_pagers: Don't trim returned rows
  select_statement: Don't always trim result set
  query_result_merger: Limit rows
  mutation_query: to_data_query_result enforces row limit
2016-12-18 08:15:51 +02:00
Avi Kivity
6bb875bdb7 Merge "storage_proxy: Enforce partition limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."

* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
  storage_proxy: Decrease limits when retrying command
  storage_proxy: Don't fetch superfluous partitions
  query::result: Add partition count
  column_family: Use counters in query::result::builder
  query_result_builder: Use the underlying counters
  mutation_partition: Count partitions in query_compacted
  mutation_partition: Remove tabs in query_compacted
  query::result::builder: Add partition count
  query_result_merger: Limit partitions
2016-12-16 13:57:37 +02:00
Glauber Costa
7133583797 track streaming and system virtual dirty memory
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.

After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
2016-12-16 10:59:40 +02:00
Avi Kivity
293876c72f Merge "Limit number of readers streaming uses" from Paweł
"Original, naive db::make_streaming_reader() implementation created a set
of memtable and sstable readers for every partition range. This caused
bad interaction with the code limiting sstable readers concurrency and
was suboptimal.

This series introduces multi range mutation reader that takes mutation
source and a sorted, disjoint vector of ranges. It creates only a single
set of memtable and sstable readers and fast forwards it to the next
range once the current one is completed."

* 'pdziepak/multi-range-reader/v1' of github.com:cloudius-systems/seastar-dev:
  db: use multi range reader for streaming readers
  dht: describe split_range[s]_to_shards() guarantees
  repair: remove outdated fixme
  test/mutation_reader_test: add multi_range_reader test
  tests/mutation_reader: extract key creation code
  mutation_reader: add multi_range_reader
2016-12-15 17:48:31 +02:00
Paweł Dziepak
cf679a413c db: use multi range reader for streaming readers
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.

The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
2016-12-15 13:54:43 +00:00
Paweł Dziepak
b86a826baf dht: describe split_range[s]_to_shards() guarantees
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
2016-12-15 13:07:32 +00:00
Paweł Dziepak
5287417136 repair: remove outdated fixme 2016-12-15 13:07:32 +00:00
Paweł Dziepak
5b0cf20f75 test/mutation_reader_test: add multi_range_reader test 2016-12-15 13:07:32 +00:00
Paweł Dziepak
787a976c2b tests/mutation_reader: extract key creation code 2016-12-15 13:07:32 +00:00
Paweł Dziepak
52a4e79210 mutation_reader: add multi_range_reader
So far, the only way to combine outputs of multiple readers was to use
combining reader. It is very general and, in particular, supports case
when the readers emit mutations from overlapping ranges.

However, we have cases (e.g. streaming) when we need to read from
several disjoint ranges. Combining reader is a suboptimal solution as it
requires to creating a reader for each range and ignores the fact that
they do not overlap.

This patch introduces multi_range_mutation_reader which takes a
mutation_source and a sorted set of disjoint ranges. Internally, it uses
mutation_reader::fast_forward_to() to move to the next range once the
current one is completed.
2016-12-15 13:07:31 +00:00
Duarte Nunes
0518895f5b query_pagers: Don't trim returned rows
Since storage_proxy::query() now respects the read_command limits, we
can remove the trimming logic from query_pagers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 11:00:46 +00:00
Duarte Nunes
7ce859799b select_statement: Don't always trim result set
Trimming the result set is only needed when the query contains an "IN"
relation, an ORDER BY clause, and defines a limit, which is the case
where we query different ranges concurrently. We don't use the
result_merger to trim since we first need to reorder the rows.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 11:00:46 +00:00
Duarte Nunes
fee0b7fa48 query_result_merger: Limit rows
This patch makes the row limit enforced by the storage_proxy layer.
It adds a row limit to the query_result_merger, useful when merging
results for concurrent queries.

More importantly, it provides guarantees that upper layers may be
relying on implicitly (e.g., the paging code).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 11:00:36 +00:00
Duarte Nunes
efc986d548 mutation_query: to_data_query_result enforces row limit
This patch changes mutation_query::to_data_query_result() so that it
enforces the row limit alongside the partition limit and the
per-partition limit.

In the following patch, we'll enforce the row limit in an upper layer,
but this lets us optimize the case where only when replica replies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:56:40 +00:00
Duarte Nunes
c2072c7dc9 storage_proxy: Decrease limits when retrying command
This patch changes a read_command's limits when retrying it, so that
we don't ask for more rows than necessary.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:41:06 +00:00
Duarte Nunes
9572c19dc6 storage_proxy: Don't fetch superfluous partitions
This patch ensures we keep track of how many partitions we've queried
so we don't ask for more than the number we need.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
93be8d7cef query::result: Add partition count
This patch adds a partition count to query::result, filled by the
query::result::builder. The partition count is present whenever the
result carries data, being absent only for the case where the result
contains only a digest.

We also ensure that counts are present for an empty query::result.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
781cd82cb8 column_family: Use counters in query::result::builder
This patch changes column_family::query() to use the counters in the
builder to determine how many partitions and rows to ask for and also
to implement the stop condition. This saves a continuation to do the
bookkeeping, and allows us to remove data_query_result.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
05b2ef4fa2 query_result_builder: Use the underlying counters
This patch changes the query_result_builder to use the counters
provided by the query::result::builder. It also ensures they are kept
current.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
f5cf7f7921 mutation_partition: Count partitions in query_compacted
This patch changes mutation_partition::query_compacted() to count the
number of partitions written to the underlying writer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
f21dfb8217 mutation_partition: Remove tabs in query_compacted
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
2409b6b250 query::result::builder: Add partition count
This patch adds a partition count to the query::result::builder. It is
intended to be incremented by users, and later used to build a
query::result.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
108011a839 query_result_merger: Limit partitions
This patch adds a partition limit to the query_result_merger, useful
when merging results for concurrent queries. This change also makes
the partition limit enforced by the storage_proxy layer, no changes
being needed by the upper layers, namely the Thrift interface.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:41 +00:00
Pekka Enberg
06c5216c9d Merge "Improve gossip feature logging" from Asias 2016-12-15 10:36:54 +02:00
Asias He
e578e65103 gossip: Log feature enabled message on shard zero only
Feature is per node. No need to log them number of shards times.
2016-12-15 16:33:11 +08:00
Asias He
4137fab91b gossip: Make log in check_features debug level
We saw the message twice for the same feature check. This is a bit
confusing.

INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}

This is because

   ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE);
   ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE);

The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE)
is constructed. The second message is printed when the
ss._range_tombstones_feature is copy-constructed.
2016-12-15 16:33:10 +08:00
Asias He
2b1ebc4719 gossip: Introduce gms:features::enable helper
Add the helper function to enable the a feature and log the feature is
enabled.

When a feature is enabled, we see

INFO  2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled
INFO  2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled

in the log.
2016-12-15 16:33:10 +08:00
Paweł Dziepak
b70e5d2089 Merge seastar upstream
Submodule seastar 6fbd792..0b98024:
  > fstream: fix read ahead byte metric types
  > fstream: add read-ahead metrics
  > future-util: make stop_iteration use bool_class<>
  > util: introduce bool_class<Tag>
2016-12-14 15:01:13 +00:00
Avi Kivity
57f4910832 Merge "Query result size limiting" from Paweł
"This series makes Scylla limit size of query results it produces in case they
grow unreasonably large. This is possible because CQL paging queries do not
guarantee that the returned page is going to have page_size rows and pages
smaller than tha *do not* indicate end of stream. Non-paged queries and Thrift
requests do not have such flexibility and they also get all the requested data
(though their memory usage is still accounted for and may limit paged queries).

There is a maximum result size (1 MB) and all results builders will stop after
reaching it. Moreover, there is a per-shard limitation on the amount of memory
used by all results combined (10%). To avoid tiny results a query has to
reserve (wait if necessary) 4 kB before starting executing, after that it can
consume more memory without any additional waiting provided it is below
individual and shard-local limits.

Enabling the cluster to return less rows than requested also means some changes
for the coordinator. Firstly, if it receives such short result from a replica
retrying it with a larger limit obviously makes no sense whatsoever. Instead,
in such cases the coordinator removes the clustering rows it has incomplate
information about and sends short result back to the client. Moreover, even
if no replica returned short response reconciliation may have made it so. In
this case, the coordinator do not necessairly need to retry the query as well.
Unfortunately, with the current implementation short responses ruin data
queries since they will cause a digest mismatch.

Three new metrics were added:
 * database_bytes_total_result_memory -- total memory used by query results
 * database_total_operations_short_data_queries -- data queries that were
   limited by size, particulary bad as it basically forces coordinator to
   retry them as mutation queries
 * database_total_operations_short_mutation_queries -- mutation queries limited
   by size"

* 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: clean up after primary_key introduction
  cql3: allow short reads with paged queries
  storage_proxy: handle intentional short reads
  storage_proxy: make sure coordinator has complete data
  storage_proxy: honour partition limit
  storage_proxy: use cmd limits to determine that replica reached end
  db: add metrics for short reads and memory used for results
  data_query: limit result size
  mutation_query: limit result size
  db: create result_memory_accounters when starting query
  query_builder: add partition_slice getter
  reconcilable_result: keep result_memory_tracker object
  mutation_compactor: honour stop_iteration from consumers
  db: add result_memory_limiter
  query: add result size limiter
  reconcilable_result: properly propagate short_read flag
  query_pagers: handle short reads properly
  query: allow short reads
  serializer_impl: add serializer for bool_class<Tag>
2016-12-14 16:53:07 +02:00
Paweł Dziepak
4c69d7e2fe storage_proxy: clean up after primary_key introduction
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dde4bd5051 cql3: allow short reads with paged queries
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
3c173d87b5 storage_proxy: handle intentional short reads
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dd67de7218 storage_proxy: make sure coordinator has complete data
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).

However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
2ff5308d8e storage_proxy: honour partition limit
At the moment the coordinator does not care much for the partition
limit. In particular it doesn't check whether after reconciliation the
result still contains enough partitions.

This patch makes it honour the partition limit and increase it in the
retried queries if necessary.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
7bed7aa7de storage_proxy: use cmd limits to determine that replica reached end
Coordinator may retry a query with larger limits. However, code
determining whether replica has no more data always used the original
limits. This may cause a livelock.

For example, consider cluster having the following partitions (deletions
cover live cells):

node1:
pk=0, v=0
pk=1, v=1

node2
delete pk=0
delete pk=1
pk=2, v=2
pk=3, v=3

Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is
going to send partitions 0 and 1 while second node is going to send 2
and 3 + tombstones for 0 and 1. The coordinator will decide that it
needs to retry the request with larger row limit since node1 may have
some information about partitions 2 and 3 that are newer than what node2
has sent.

However, when the second response arrives node1 will still sent only two
rows since it has no more data. Because the coordinator uses original
row limit it will not notice that this node reached the end and we are
going to get another retry without making any progress.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
cfd4d0f680 db: add metrics for short reads and memory used for results
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
ba51e7e8db data_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
f1b9f49f2b mutation_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
6c33a4f177 db: create result_memory_accounters when starting query
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.

Moreover, range queries need a way of merging memory usage information
from different shards.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
0bce4047bd query_builder: add partition_slice getter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
15de8de9e5 reconcilable_result: keep result_memory_tracker object
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
34f9eb4cbd mutation_compactor: honour stop_iteration from consumers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
5d7185fd39 db: add result_memory_limiter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
ee89d80d5c query: add result size limiter
This patch introduces an infrastrucutre for limiting result size.

There is a shard-local limit which makes sure that all results combined
do not use more than 10% of the shard memory.
There is also an invidual limit which restricts a result to 4 MB.
In order

In order to avoid sending tiny results there is minimum guaranteed size
(4 kB), which the query needs to reserve before it starts producing the
result.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
43fe3439ca reconcilable_result: properly propagate short_read flag
reconcilable_result can be merged with another or transformed into
query::result. Make sure that short_read information is never lost.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
837d24f1b2 query_pagers: handle short reads properly
Currently, the paging implementation assumes that the server retunrs
either as many rows as it was asked for all reached the end. Soon,
that's not going to be true so instead of making any assumptions about
the number of the rows returned use the new "short read" flag to
determine whether there is going to be more data.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
da7ca85040 query: allow short reads
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Paweł Dziepak
7a15c89b1d serializer_impl: add serializer for bool_class<Tag>
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Takuya ASADA
8918a4be57 dist/common/scripts/scylla_setup: don't abort scylla_setup when each setup script failed
Instead of abort scylla_setup, print warning message then continue to next setup.

Fixes #1357

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481713664-18429-1-git-send-email-syuu@scylladb.com>
2016-12-14 13:31:50 +02:00
Tomasz Grabiec
c9344826e9 tests: Remove unintentional enablement of trace-level logging
Sneaked in by mistake.
2016-12-14 10:58:07 +01:00
Tomasz Grabiec
fe6a70dba1 tests: commitlog: Fix assumption about write visibility
The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.

Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.

Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
2016-12-14 11:29:33 +02:00
Avi Kivity
a61ff53150 Merge "rework flush criteria" from Glauber
"The current criteria for memtable flush is not being respected.  The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.

More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.

To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.

Fixes #1918"

* 'flush-criteria-v6' of github.com:glommer/scylla:
  config: get rid of memtable_total_space
  database: rework dirty memory hierarchy
  system keyspace: write batchlog mutation in user memory
  database: remove flush_token
  database: abstract pressure condition notification
  database: encapsulate semaphore_units into a flush_permit
  database: remove friendship declaration
  database: simplify flush_one
  database: make memtable_list aware in cases it can't flush
2016-12-14 11:24:10 +02:00
Takuya ASADA
c18a95cddf dist/redhat: add scylla_lib.sh to scylla.spec
Fix .rpm build error.

Fixes #1932

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481703992-9596-1-git-send-email-syuu@scylladb.com>
2016-12-14 10:27:37 +02:00
Glauber Costa
56df53f51e compaction_manager: fix shutdown sequence
By the time we are able to acquire this semaphore, we may be stopped
already. So we need to test it before we go ahead. I can see shutdown
hangs before this patch that are fixed with it applied.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <e5b378893128d086d584ffbb2acd3fb687648e5c.1481655433.git.glauber@scylladb.com>
2016-12-14 09:26:24 +01:00
Asias He
84fa2c91c7 repair: Reduce unnecessary streaming traffic even more
In 7c873f0d (repair: Reduce unnecessary streaming traffic), we optimize
in cases when 1) all the remote nodes has the same checksum and 2) local node
has zero checksum.

In this patch, we make the optimization more generec and cover more cases.

1) With RF = 3, 3 nodes cluster, rm data on node3 then run repair on node2

Before:
INFO  2016-12-09 16:24:31,961 [shard 0] repair - Found differing range (-4091524285777924069, -4086237930244473115]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO  2016-12-09 16:24:31,963 [shard 0] repair - Found differing range (-609511120964672970, -605253169726090861]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}
INFO  2016-12-09 16:24:31,964 [shard 0] repair - Found differing range (-7655412157560911259, -7652234653747163387]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO  2016-12-09 16:24:31,965 [shard 0] repair - Found differing range (-4133815130045531703, -4128528774512080749]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO  2016-12-09 16:24:31,967 [shard 0] repair - Found differing range (-605253169726090861, -600995218487508751]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}
INFO  2016-12-09 16:24:31,968 [shard 0] repair - Found differing range (438510347741343837, 441475345714861354]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}

After:
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-660606535827658284, -656348584589076175]
on nodes {127.0.0.1, 127.0.0.3}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4234255885181099833, -4228969529647648879]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4228969529647648879, -4223683174114197925]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4223683174114197925, -4218396818580746971]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-7728494745277112315, -7725317241463364443]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-720217853167807818, -715959901929225709]
on nodes {127.0.0.1, 127.0.0.3}, in = {}, out = {127.0.0.3}

Before, we need to fetch data from both node 1 and node 3 and send data back to node 1 and node 3, i.e., 2 IN, 2 OUT

After, we only need to fetch data from node 3, i.e. 0 IN, 1 OUT

We saved 3X traffic, with higher RF, we can save even more.

2) With RF = 3, 3 nodes cluster, rm data on node3 then run repair on node3

Before:
INFO  2016-12-09 16:20:11,448 [shard 0] repair - Found differing range (-8533861887892628919, -8052600134279395253]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO  2016-12-09 16:20:11,465 [shard 0] repair - Found differing range (7190719703944308372, 7692358524564683543]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO  2016-12-09 16:20:11,486 [shard 0] repair - Found differing range (-3305328316052774469, -2671876682129336880]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO  2016-12-09 16:20:11,494 [shard 0] repair - Found differing range (-2190610927722759275, -1305178847032904465]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:20:11,518 [shard 0] repair - Found differing range (-4747032371925842389, -4070378863644120252]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:20:11,519 [shard 0] repair - Found differing range (-1137497074548854552, -592479316010344531]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}

After:
INFO  2016-12-09 16:29:22,433 [shard 0] repair - Found differing range (67885601051654285, 447405341661896387]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:29:22,454 [shard 0] repair - Found differing range (-2190610927722759275, -1305178847032904465]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:29:22,473 [shard 0] repair - Found differing range (2523396860109747637, 3083778975065200884]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:29:22,474 [shard 0] repair - Found differing range (-3305328316052774469, -2671876682129336880]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO  2016-12-09 16:29:22,487 [shard 0] repair - Found differing range (-4747032371925842389, -4070378863644120252]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:29:22,493 [shard 0] repair - Found differing range (-1137497074548854552, -592479316010344531]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}

This shows the new more generic methods covers the optimization we had before as well.
2016-12-14 09:37:35 +08:00
Asias He
bd1cd53b2a repair: Add hash specialization for partition_checksum
So we can store partition_checksum in std::map as key.
2016-12-14 09:33:16 +08:00
Glauber Costa
2aa6514667 config: get rid of memtable_total_space
Those values are now statically set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 17:05:12 -05:00
Glauber Costa
80440c0d79 database: rework dirty memory hierarchy
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.

That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.

This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors

To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.

Fixes #1918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 14:07:53 -05:00
Glauber Costa
db7cc3cba8 system keyspace: write batchlog mutation in user memory
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:35 -05:00
Glauber Costa
be9e4c71ad database: remove flush_token
We had a flush_token structure in addition to the flush_permit because
we needed to keep a pointer to the dirty_memory_manager and apply
changes to the region group upon the region destruction. Since Tomek's
latest series, this is no longer needed and now this structure doesn't
have a place in the world anymore. Simplify the code by removing it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
98030ad66c database: abstract pressure condition notification
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
c9a8b03311 database: encapsulate semaphore_units into a flush_permit
We will soon need to hold more than a semaphore_units<> object per
flush, potentially.

Preparation patch for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
2e8c7d2c62 database: remove friendship declaration
Not needed anymore since memtable started having a direct pointer to the
memtable list.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
bb1509c21e database: simplify flush_one
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.

It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
8ab7c04caa database: make memtable_list aware in cases it can't flush
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.

We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.

It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Takuya ASADA
0a6312d254 dist/common/scripts/scylla_ntp_setup: fix incorrect usage of is_debian_variant
Use it as "if is_debian_variant; then".
Fixes #1931

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481644262-29383-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:42 +02:00
Takuya ASADA
ed4cd1908f dist/common/scripts/scylla_selinux_setup: correct CentOS/RHEL detection
CentOS/RHEL is using SELinux, and it's NOT Debian variant, so fixed from
"is_debian_variant" to "! is_debian_variant".

Fixes #1930

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643873-28984-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:29 +02:00
Takuya ASADA
6c0dc55495 dist/common/scripts/scylla_selinux_setup: to use is_debian_variant(), need to source /usr/lib/scylla/scylla_lib.sh
This fixes following command not found error:
```
/usr/sbin/scylla_selinux_setup: line 7: is_debian_variant: command not found
```

Fixes #1929

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643308-28637-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:13 +02:00
Takuya ASADA
3b74c50546 dist/ubuntu: add uuidgen to package dependency
We haven't added uuidgen to Ubuntu/Debian package dependency, so scylla_setup
script may abort because of command not found.

Fixes #1928

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481642385-27941-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:28:48 +02:00
Duarte Nunes
1e75a4950e database: Complete query when hitting partition limit
Currently, we weren't completing a query as early as possible if it
reached the partition limit, we instead had to wait until reaching the
end of the specified partition ranges. This patches fixes that by
including a check to the partition limit in the termination condition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>

Message-Id: <20161213114559.26438-1-duarte@scylladb.com>
2016-12-13 14:53:46 +02:00
Tomasz Grabiec
f451014785 schema: Implement operator<< for column_mapping
Message-Id: <1481310679-14074-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:20:46 +02:00
Tomasz Grabiec
059a1a4f22 db: Fix commitlog replay to not drop cell mutations with older schema
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.

During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.

Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.

Fixes #1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:19:32 +02:00
Avi Kivity
32d55bbb4c Merge seastar upstream
* seastar 0773e98...6fbd792 (2):
  > tls: Only run our "verify" function in client session
  > Merge "Clean the metric definition" from Amnon

Includes patch from Amnon adjusting the metrics registration due to seastar
API changes.
2016-12-13 12:17:14 +02:00
Avi Kivity
6f9c317b91 Merge "Use uuid file in housekeeping" from Amnon
"This patch adds the use of uuid file to the housekeeping daily version check.
uuid file are optional, if a file is missing no uuid will be used."
2016-12-13 10:52:44 +02:00
Avi Kivity
c67782f169 Merge seastar upstream
* seastar 0a74317...0773e98 (6):
  > tls: Add support for client cetrificate verification & priority strings
  > semaphore: add consume_units
  > semaphore: add available_units()
  > thread: check need_preempt for threads in a scheduling group as well
  > tutorial: fix semaphore example, and text
  > stop_iteration: add && and || operators
2016-12-12 18:06:19 +02:00
Avi Kivity
c801cc4bd1 Merge "streaming and repair updates" from Asias
"This series:
- We can make reader with ranges
- Fix possible use after free of 'si'
- Streaming ranges now are sorted and merged
- Fix shard_begin shard_end end loop in both streaming and repair"
2016-12-12 11:32:42 +02:00
Asias He
ba54654af3 streaming: Use interval_set to sort and merge ranges
So that the ranges are sorted and have no overlaps. We can have less
ranges to deal with and it can help the mutation readers to optimize.

Here is an exmaple of ranges generated by repair:

Before:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    before ranges = {(-3383928698815274642, -3376937163195039606],
    (-7260764223708720005, -7251657821052234309], (-4767213984179237293,
    -4747032371925842389], (-7645879646119667643, -7589962743703481776],
    (-2340199306656526861, -2320523117224780931], (-576028861239229331,
    -560973674020019962], (-4070378863644120252, -3987599893827407860],
    (-2551584407739673151, -2498779102482524711], (-5416061903556353312,
    -5354212455975869358], (37594980457713898, 67885601051654285],
    (3083778975065200884, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (5765437476661317163, 5778671293583720541],
    (5960610072466058818, 5972289771228014343], (7749618183851698485,
    7758080813117351135], (-3987599893827407860, -3899198931034439776],
    (-7251657821052234309, -7131649010279865221], (-3576581915808403133,
    -3383928698815274642], (-417850207760366422, -327959672080599465],
    (-2671876682129336880, -2551584407739673151], (-1305178847032904465,
    -1137497074548854552], (8540448858050275827, 8610171849752115483],
    (-560973674020019962, -417850207760366422], (-2498779102482524711,
    -2340199306656526861], (2394447940525988167, 2523396860109747637],
    (-6703329224557608009, -6517757811218772762], (-3675103288021821677,
    -3576581915808403133], (-5622185785296846551, -5416061903556353312],
    (8610171849752115483, 8742605005068551458], (8068079250973315241,
    8185655671734937642], (560264964510741191, 790641981923757238],
    (5581202487214475094, 5765437476661317163], (8742605005068551458,
    8923908282731801645], (-6038176423022601107, -5622185785296846551],
    (5778671293583720541, 5960610072466058818], (-3899198931034439776,
    -3675103288021821677], (8356739976149429222, 8540448858050275827],
    (-6517757811218772762, -6038176423022601107], (-8052600134279395253,
    -7645879646119667643], (-327959672080599465, 37594980457713898],
    (7758080813117351135, 8019254284118543066], (4781565016737645510,
    5067070718000527886], (2523396860109747637, 3083778975065200884],
    (-5354212455975869358, -4767213984179237293], (6784138025918878582,
    7190719703944308372], (67885601051654285, 447405341661896387],
    (-2190610927722759275, -1305178847032904465], (-4747032371925842389,
    -4070378863644120252]}, size=48

After:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    after  ranges = {(-8052600134279395253, -7589962743703481776],
    (-7260764223708720005, -7131649010279865221], (-6703329224557608009,
    -3376937163195039606], (-2671876682129336880, -2320523117224780931],
    (-2190610927722759275, -1137497074548854552], (-576028861239229331,
    447405341661896387], (560264964510741191, 790641981923757238],
    (2394447940525988167, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (4781565016737645510, 5067070718000527886],
    (5581202487214475094, 5972289771228014343], (6784138025918878582,
    7190719703944308372], (7749618183851698485, 8019254284118543066],
    (8068079250973315241, 8185655671734937642], (8356739976149429222,
    8923908282731801645]}, size=15
2016-12-12 11:09:26 +08:00
Asias He
e523803a5d token_metadata: Introduce interval_to_range helper
It is used to convert a boost::icl::interval<token> interval back to a
range<token>.
2016-12-12 11:09:26 +08:00
Asias He
af3d76e6ac repair: Fix a typo in the log
sucessfully -> successfully
2016-12-12 11:09:26 +08:00
Asias He
374324e6fb repair: Fix shard_begin and shard_end
A range now alternates between different shards: the first part of the
range goes to shard X, the next to shard X+1, but after a while we go
back to shard X. So we can't do a simple loop between shard_begin and
shard_end.

Fix by using the newly introduced dht::split_range_to_shards

Use the cf.make_streaming_reader with ranges to simplify the code a bit.
2016-12-12 11:09:26 +08:00
Asias He
1987264beb streaming: Make streaming reader with ranges
Now that we have the new interface to make readers with ranges, we can
simplify the code a lot.

1) Less readers are needed
before: number of ranges of readers
after: smp::count readers at most

2) No foreign_ptr is needed
There is no need to forward to a shard to make the foreign_ptr for
send_info in the first phase and forward to that shard to execute the
send_info in the second phase.

3) No do_with is needed in send_mutations since si now is a
lw_shared_ptr

4) Fix possible user after free of 'si' in do_send_mutations
We need to take a reference of 'si' when sending the mutation with
send_stream_mutation rpc call, otherwise:
   msg1 got exception
   si->mutations_done.broken()
   si is freed
   msg2 got exception
   si is used again
The issue is introduced in dc50ce0ce5 (streaming: Make the mutation
readers when streaming starts) which is master only, branch 1.5 is not
affected.
2016-12-12 09:04:21 +08:00
Asias He
463cc4fbde dht: Introduce split_ranges_to_shards
Split a ranges into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
044c4ff44c dht: Introduce split_range_to_shards
Split a range into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
cd2105b8bd database: make_streaming_reader for ranges
Allow to make a streaming reader with a vector of ranges in addition to
a single range. This will be used soon in following streaming patch.

We can make the reader more efficient later.
2016-12-12 09:04:21 +08:00
Duarte Nunes
ada2f1092e dht: Make i_partitioner::tri_compare pure virtual
This patch makes the i_partitioner::tri_compare() function pure
virtual as it is overridden by all partitioners.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211172037.16496-1-duarte@scylladb.com>
2016-12-11 19:29:37 +02:00
Duarte Nunes
bb66b051ed dht: Make i_partitioner::tri_compare memory safe
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
2016-12-11 18:58:10 +02:00
Amnon Heiman
08dcd8cb4a scylla housekeeping ubuntu service: use uuid file
This patch adds uuid file support for ubuntu system. It also split the
behaviour between restart and daily checks. The first run in r mode and
the second in d mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:35:07 +02:00
Amnon Heiman
6fef24aaf0 housekeeping systemd service: use uuid file
This set the housekeeping systemd service to use a uuid file and use
daily mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:02:16 +02:00
Amnon Heiman
17b8306bc4 scylla-housekeeping support uuid file
Allows scylla-housekeeping getting the uuid from a file instead of the
command line.

If the file is missing no uuid will be used.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:00:34 +02:00
Avi Kivity
299d1fad0b Merge "reduce bloom filter overhead in compaction" from Raphael
"Function to calculate maximum purgeable timestamp is made 10 times faster when
compacting sstables overlap with 10% of all sstables.
That's possible with an incremental selector that will incrementally select
sstables based on key being compacted.
Currently, we iterate through all non-compacting sstables and consult their
bloom filter to determine max purgeable timestamp, and that will be very
expensive for compactions that are frequently deciding whether or not to purge
tombstones."

* 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla:
  compaction: reduce bloom filter overhead with incremental selector
  tests: add test for sstable set's incremental selector
  sstable_set: introduce incremental selector
  compatible_ring_position: add function to return token
2016-12-11 09:46:58 +02:00
Glauber Costa
5803957ab5 compaction: fix build
Commit 732ee275 moved tracking of one statistics value inside a lambda
without capturing this in that lambda. Compilation fails as a result.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <68860640f4533dd43e43f341f1620e25464b700b.1481313455.git.glauber@scylladb.com>
2016-12-10 09:00:20 +02:00
Raphael S. Carvalho
fcfc84e836 compaction: reduce bloom filter overhead with incremental selector
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.

Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.

Fixes #1322.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
548f6066c5 tests: add test for sstable set's incremental selector
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
02541e15c1 sstable_set: introduce incremental selector
Incrementally select sstables from sstable set using token
in ascending order.
For leveled strategy, it returns all sstables that belong
to current interval. For other strategies, it just return
all sstables from the set.
Useful for compaction which needs all sstables that overlap
with key being currently compacted to calculate maximum
purgeable timestamp.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:16 -02:00
Glauber Costa
9b5e6d6bd8 commitlog: correctly report requests blocked
The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.

That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.

This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
2016-12-09 15:02:26 +02:00
Raphael S. Carvalho
732ee275f8 compaction: fix running compaction counter when splitting sstables
The counter was being increased before taking the semaphore, so
every pending split would count as a running compaction which
misleads the user as a result.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <f2050cc3599cee7af29d4579368a154708b37731.1481248048.git.raphaelsc@scylladb.com>
2016-12-09 15:01:43 +02:00
Raphael S. Carvalho
453620a316 compatible_ring_position: add function to return token
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-08 14:25:29 -02:00
Avi Kivity
872b5ef5f0 sstables: fix probe with Unknown component
Commit 53b7b7def3 ("sstables: handle unrecognized sstable component")
ignores unrecognized components, but misses one code path during probe_file().

Ignore unrecognized components there too.

Fixes #1922.
Message-Id: <20161208131027.28939-1-avi@scylladb.com>
2016-12-08 15:24:25 +01:00
Glauber Costa
733d87fcc6 database: try to acquire semaphore before we start flush
As Tomek pointed out, as we are starting the flush before we acquire the
semaphore, we are not really limiting parallelism, but only delaying the
end of the flush instead.

Fixes #1919

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <6cbf9ec2f3a341c76becf94f794cfa16539c5192.1481120410.git.glauber@scylladb.com>
2016-12-08 12:18:32 +01:00
Tomasz Grabiec
3511bf4a81 Merge branch 'tgrabiec/memtable-gentle-clearing' from seastar-dev.git
When row cache is disabled, update_cache() will do nothing to the
memtable. Active readers may keep the memtable alive for unbounded
amount of time, preventing it from going away. This doesn't play well
with virtual dirty accounting. Soon before calling update_cache(), the
memory which was subtracted during flush is added back to the amount
of virtual dirty memory. If there was write pressure all along, we
will be at the dirty memory limit. When we give back subtracted memory
this will put virtual dirty way above the limit. This will stall all
writes until another memtable flush drags virtual dirty down or
readers finally release the memtable. We want to prevent upward
jumps of virtual dirty.

First part of the fix is to ensure that as long as the memtable's
region is in the dirty group, we will not revert flushed memory. This
must happen synchronously from region's memory being removed from the
group in order to prevent upward virtual dirty jumps. To make this
easier, tracking of flushed memory was moved to the memtable object.

Another part of the fix is to gradually clear the memtable when cache
is disabled in a similar fashion as when it's moved to cache. This
ensures that the actual memory held by memtable's region is released
sooner than it dies.

Refs #1879
2016-12-08 12:18:32 +01:00
Gleb Natapov
a05516f14c storage_proxy: wire up range_slice_timeouts, range_slice_unavailables and read_unavailables counters
Message-Id: <20161206105154.GL1866@scylladb.com>
2016-12-08 11:42:52 +02:00
Avi Kivity
5530a61975 stables: fix build with older boost (boost::variant::get<T&>)
Older boost doesn't support boost::variant::get<T&> (where the type
parameter is reference qualified); remove (unneeded anyway).
2016-12-08 10:56:05 +02:00
Pekka Enberg
0bc3ce7e09 Merge "sstables: remove sharding metadata from Statistics component" from Avi
"Due to my misreading of Cassandra code, I thought it would ignore new
components in the Statistics component; however, it doesn't, and the change
(introduced in bdd11648ac ("sstables: add
intra-node sharding metadata") breaks sstable2json and likely any
Cassandra code that touches sstables.

To fix, move the sharding data into a new component ("Scylla.db"), which
Cassandra does ignore.  The new component is designed to be extensible so
we don't experience the same issue later on."
2016-12-08 10:14:07 +02:00
Avi Kivity
7f26f9c0f9 Merge "repair refactor and fix" from Asias
* tag 'asias/repair/subranges/refactor_fix/v1' of github.com:cloudius-systems/seastar-dev:
  repair: Limit the number of sub ranges
  repair: Use estimated_keys_for_range in repair_cf_range
  repair: Extract the target_partitions into repair_info class
  repair: Put request_transfer_ranges into repair_info class
  repair: Introduce check_failed_ranges helper
  repair: Introduce do_streaming helper
  repair: Make the neighbors const reference
  repair: Introduce repair_info
  repair: Attach the repair id in the stream plan name
2016-12-08 10:06:39 +02:00
Tomasz Grabiec
f7197dabf8 commitlog: Fix replay to not delete dirty segments
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.

The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.

Introduced in abe7358767.

dtest failure:

 commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup

Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
2016-12-07 15:54:47 +02:00
Avi Kivity
4fedbf8430 Merge "service::storage_proxy: rework collectd counters registration" from Vlad
- Add "coordinator" and "replica" categories
   - Use a new seastar/metrics_registration framework

* 'rearrange-storage-proxy-stats-v4' of github.com:cloudius-systems/seastar-dev:
  service::storage_proxy: rework the collectd counters registration
  service/storage_proxy: regroup collectd statistics
2016-12-07 15:38:40 +02:00
Avi Kivity
3c3a18f222 sstables: move sharding metadata from Statistics component to a new Scylla component
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
2016-12-07 15:20:13 +02:00
Avi Kivity
24140ec8c6 sstables: add support for sets of discriminated union types
Allow declaring discriminated unions (with an enum type as the
discriminant and any sstable serializable type as a value) and sets
of these unions, with the disciminant as the key.  Parsers and writers
are auto-generated.
2016-12-07 13:27:52 +02:00
Avi Kivity
e0cce9d299 Merge "streaming: Improve logging" from Asias
"This seires adds streaming bandwidth and streaming plan name to the log when
streaming is finished."
2016-12-07 12:21:47 +02:00
Amos Kong
f32f7993cc systemd: reset housekeeping timer at each start
Currently housekeeping timer won't be reset when we restart scylla-server.
We expect the service to be run at each start, it will be consistent with
upstart script in Ubuntu 14.04

When we restart scylla-server, housekeepting timer will also be restarted,
so let's replace "OnBootSec" with "OnActiveSec".

Fixes: #1601

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <a22943cc11a3de23db266c52fd476c08014098c4.1480607401.git.amos@scylladb.com>
2016-12-06 18:33:37 +02:00
Takuya ASADA
5a5ab51254 dist/ubuntu/dep: fix incorrect file path to detect previously built .deb existance check
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-4-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
6dd6b868a6 scripts/scylla_install_pkg: support Debian
Supported Debian on installation script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-3-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
7f2df8f86e dist/common/scripts: introduce scylla_lib.sh
To reduce duplicated code and simplified scripts introduce scylla_lib.sh
for shellscripts which provides functions to classify distributions,
and load all sysconfig files.

This also fixes script bugs to misdetect Debian and RHEL.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-2-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
8464903021 dist/common/systemd/scylla-housekeeping.timer: workaround to avoid crash of systemd on RHEL 7.3
RHEL 7.3's systemd contains known bug on timer.c:
https://github.com/systemd/systemd/issues/2632

This is workaround to avoid hitting bug.

Fixes #1846

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480452194-11683-1-git-send-email-syuu@scylladb.com>
2016-12-06 10:48:28 +02:00
Takuya ASADA
b2c0059da3 dist/common/scripts/scylla_coredump_setup: use systemd-coredump on Ubuntu 16.04
Ubuntu 16.04 has systemd-coredump, better to use it.

Fixes #1916

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480679267-30844-1-git-send-email-syuu@scylladb.com>
2016-12-05 17:09:38 +02:00
Takuya ASADA
2976799ef2 main: fix startup failing on Ubuntu 15.10/16.04
Since Ubuntu 15.10/16.04 still uses Upstart to manage GUI session (not as init), when we directly launch Scylla on Ubuntu's GUI Terminal(not using systemctl or initctl), raise(SIGSTOP) mistakenly calls (Because GUI session has "UPSTART_JOB" environment variable, won't happen when running Scylla as systemd service).

To avoid this, we need to verify UPSTART_JOB == "scylla-server".
If it's part of GUI session UPSTART_JOB has to be "unity7", we need to avoid raise(SIGSTOP) in that case.

Fixes #1199

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480620421-28967-1-git-send-email-syuu@scylladb.com>
2016-12-05 16:28:25 +02:00
Tomasz Grabiec
527ff6aa40 db: Clear memtable after flush when cache is disabled
So that memory is released gradually (impacting latency less) and
sooner than when memtable is destroyed. Active readers may keep the
memtable alive for unbounded amount of time.

Refs #1879
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1bba51319e memtable: Maintain virtual dirty on clear()
When memtable is flushing, it subtracts _flushed_memory from groups's
size to gradually allow more writes. Ideally _flushed_memory would be
equal to region's size when flush ends, so the group's size would
reach zero. When the memtable and its region are gone the group size
should remain the same as after the flush. This is ensured by adding
back _flushed_memory to group's size right before the region is
removed from the group.

Calling clear() before region is removed from the group breaks the
accounting because it will shrink the region, but will not affect the
amount of memory subtracted due to _flushed_memory. So group's size
would decrease more than we want (twice the region's size). The fix is
to change clear() so that it reverts _flushed_memory by the amount by
which the region size is reduced. This will keep the groups's size
constant as long as _flushed_memory > 0.
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1b5f338c17 memtable: Track flushed memory in memtable object 2016-12-05 12:59:09 +01:00
Tomasz Grabiec
c3768fe4de memtable: Pass dirty_memory_manager& to memtable constructor
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:

  boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));

This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
2016-12-05 12:59:09 +01:00
Asias He
00d7a35949 utils: Put crc32 under utils namespace
It conflicts with crc in zlib
Message-Id: <1480918984-4117-2-git-send-email-asias@scylladb.com>
2016-12-05 11:48:29 +02:00
Takuya ASADA
54ea0055fc dist/common/scripts/node_exporter_install: use curl instead of wget
CentOS/Ubuntu contains curl on minimal instllation but wget doesn't,
and we already has dependency for curl, so we should switch to curl.

Fixes #1902

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480929047-2347-1-git-send-email-syuu@scylladb.com>
2016-12-05 11:26:36 +02:00
Asias He
86c2620b7a gossip: Skip stopping if it is not started
If exception is triggered early in boot when doing an I/O operation,
scylla will fail because io checker calls storage service to stop
transport services, and not all of them were initialized yet.

Scylla was failing as follow:
scylla: ./seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local()
[with Service = gms::gossiper]: Assertion `local_is_initialized()' failed.
Aborting on shard 0.
Backtrace:
  0x000000000048a2ca
  0x000000000048a3d3
  0x00007fc279e739ff
  0x00007fc279ad6a27
  0x00007fc279ad8629
  0x00007fc279acf226
  0x00007fc279acf2d1
  0x0000000000c145f8
  0x000000000110d1bc
  0x000000000041bacd
  0x00000000005520f1
  0x00007fc279aeaf1f
Aborted (core dumped)

Refs #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <963f7b0f5a7a8a1405728b414a7d7a6dccd70581.1479172124.git.asias@scylladb.com>
2016-12-05 09:42:37 +02:00
Asias He
49229964d0 streaming: Add streaming plan name when session is failed
Before:
[shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000]
Stream failed, peers={127.0.0.1, 127.0.0.2}

After:
[shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000]
Stream failed for streaming plan repair-in-29, peers={127.0.0.1, 127.0.0.2}
2016-12-05 08:20:18 +08:00
Asias He
1c47e26913 streaming: Add streaming plan name when all sessions are completed
Before:
[shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000]
All sessions completed, peers={127.0.0.2}

After:
[shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000]
All sessions completed for streaming plan repair-in-32, peers={127.0.0.2}
2016-12-05 08:20:18 +08:00
Asias He
984f427cb5 streaming: Log streaming bandwidth
It looks like:

[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.1 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.2 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.3 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] bytes_sent = 393284364, bytes_received = 0, tx_bandwidth = 17.048 MiB/s, rx_bandwidth = 0.000 MiB/s
[Stream #f3907fd0-a557-11e6-a583-000000000000] All sessions completed, peers={127.0.0.1, 127.0.0.2, 127.0.0.3}

Fixes #1826
2016-12-05 08:20:18 +08:00
Asias He
4ae5781e40 repair: Limit the number of sub ranges
A range is diveded into N sub ranges so that each sub range contains 100
partitions. So N depends on the number of partitions in that range.  N
can grow unbounded and the memory usage of vector to hold these sub
ranges can go unbouded.

Limit the max number of sub ranges a range can divided into.

The downside is that the limited sub range will make we include more
partitions in the checksum.

Fixes #1917
2016-12-05 08:12:48 +08:00
Asias He
d850b86145 repair: Use estimated_keys_for_range in repair_cf_range
Use the newly introduced interface to estimate number of partitions in
the range.
2016-12-05 08:05:07 +08:00
Asias He
7b63cbbe0d repair: Extract the target_partitions into repair_info class
We can tune the number on a per repair basis.
2016-12-05 08:05:07 +08:00
Asias He
d9b689321e repair: Put request_transfer_ranges into repair_info class 2016-12-05 08:05:07 +08:00
Asias He
7741393059 repair: Introduce check_failed_ranges helper
To check if there is any failed ranges and log it.
2016-12-05 08:05:07 +08:00
Asias He
f8d7aa597b repair: Introduce do_streaming helper
To execute the stream_plans to sync data between nodes.
2016-12-05 08:05:07 +08:00
Asias He
d0a6290d4f repair: Make the neighbors const reference
We do not modify it. Make it const reference.
2016-12-05 08:05:07 +08:00
Asias He
6d0f6c1a99 repair: Introduce repair_info
To reduce the number of parameters we pass around. Simplify the code a
little bit.
2016-12-05 08:05:06 +08:00
Asias He
9be5170c07 repair: Attach the repair id in the stream plan name
So that we know which repair id this stream plan belongs to.
2016-12-05 08:05:06 +08:00
Tomasz Grabiec
d496dfeced Update seastar submodule
* seastar 7790e68...0a74317 (2):
  > core/reactor: Move definitions out of #ifndef
  > Add systemtap-sdt-devel to fedora dependencies

Fixes #1915.
2016-12-02 10:49:17 +01:00
Vlad Zolotarov
e5e7ac1bd4 service::storage_proxy: rework the collectd counters registration
Use the new seastar's metrics_registration framework:
   - Change the registration syntax.
   - Add a long description for each counter.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-12-01 22:38:09 -05:00
Vlad Zolotarov
3bf12e4ffc service/storage_proxy: regroup collectd statistics
Instead of putting all statistics under the same "storage_proxy" category
separate them into 2 groups according to where the corresponding counters
are updated:
   - "storage_proxy_replica"
   - "storage_proxy_coordinator"

Fixes #1763

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-12-01 22:27:47 -05:00
Glauber Costa
99a5a77234 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>
2016-12-01 13:20:46 +01:00
Tomasz Grabiec
570fc0008b scylla-gdb: Fix lookup of symbols in 'scylla ptr'
Message-Id: <1480529617-26564-1-git-send-email-tgrabiec@scylladb.com>
2016-12-01 12:33:29 +02:00
Raphael S. Carvalho
b30a2cb21a lcs: generate info that preserves token distribution in higher levels
The information (last compacted keys) is lost after node is restarted
or schema is updated, which causes strategy to be rebuilt.
We need it for strategy to guarantee uniform distribution of token
range across sstables, or we could end up with 1 sstable of level L
overlapping with lots of sstables of level L+1, and that results in
a compaction of undesired length.
That information can be generated from scratch by getting last key
of newest sstable in each level > 0.

Fixes #1906.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <35ebd15977d5a8418239febb160c796cdc0e98fa.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:58 +02:00
Raphael S. Carvalho
38743c1948 sstables: provide write time of data component
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <59686148149f2159990329775e0cd8780bc54254.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:57 +02:00
Glauber Costa
d7256e7b21 database: do not call seal directly from the streaming timer
Streaming memtable have a delayed mode where many flushes are coalesced
together into one, with the actual flush happening later and propagated
to all the previous waiters.

However, the timer that triggers the actual flush was not using the
newly introduced flush infrastructure. This was a minor problem because
those flushes wouldn't try to take the semaphore, and so we could have
many flushes going on at the same time.

What was a potential performance issue became a correctness issue when
we moved the reversal of the dirty memory accounting out of
revert_potentially_cleaned_up_memory() into remove_from_flush_manager().

Since the latter is only called through the flush infrastructure, it
simply wasn't called. So the deferral of the reversal exposed this bug.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>
2016-11-30 18:00:55 +01:00
Tomasz Grabiec
c35e18ba12 tests: Fix use-after-free on commitlog
Only shutdown() ensures all internal processes are complete. Call it before calling clear().

Message-Id: <1480495534-2253-1-git-send-email-tgrabiec@scylladb.com>
2016-11-30 11:03:26 +02:00
Avi Kivity
281b4c64ea Update ami submodule
* dist/ami/files/scylla-ami 25e101f...d5a4397 (1):
  > scylla_install_ami: allow specify different repository for Scylla installation and receive update
2016-11-29 19:26:49 +02:00
Takuya ASADA
17ef5e638e dist/ami: allow specify different repository for Scylla installation and receive update
This fix splits build_ami.sh --repo to three different options:
 --repo-for-install is for Scylla package installation, only valid
 during AMI construction.

 --repo-for-update will be stored at /etc/yum.repos.d/scylla.repo, to
 receive update package on AMI.

 --repo is both, for installation and update.

Fixes #1872

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480438858-6007-1-git-send-email-syuu@scylladb.com>
2016-11-29 19:26:07 +02:00
Avi Kivity
5ea235e3e8 Merge "Prevent overloading memory with timed out writes" from Tomasz
"The goal of this series is to prevent unbounded memory use
in cases when requests are timing out. Write requests which timed
out may still occupy memory for a while because of local mutation
application. This memory is not accounted for and can build up.

First part of the fix changes local mutation application so that it times out
at about the same time as the request handler. Then the life
time of the request handler is extended to cover any background activity
of that request which hasn't timed out yet. This has two main effects:

  (1) by timing out local writes we prevent build up of background activity
      for timed out requests

  (2) we ensure that memory used by background activity is not left
      behind unaccounted for. This will prevent CQL server from admitting
      more requests than memory usage limit allows.

Fixes #1756."

* tag 'tgrabiec/prevent-oom-on-timeouts-v5' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: Do not flood logs with timeout errors
  database: Add counter for timed out writes
  storage_proxy: Delay timeout response until background work ceases
  storage_proxy: Propagate timeout to local writes
  storage_proxy: Use shared ownership for abstract_write_response_handler
  storage_proxy: Add counter for all alive write handlers
  db: Allow writes to be timed out
  db: Introduce counters for failed reads and writes
  commitlog: Allow allocations to be timed out
  utils/logalloc: Add ability to timeout run_when_memory_available() task
  utils/flush_queue: Add ability to wait with a timeout
2016-11-29 18:55:52 +02:00
Avi Kivity
28a5ff51cb dist: add build dependency on systemtap-sdt
Needed to newer seastar.
2016-11-29 18:49:51 +02:00
Tomasz Grabiec
48bbd6733c storage_proxy: Do not flood logs with timeout errors
Timeout errors are flooding the log after local mutate can time
out. We don't log remote mutate timeouts, so for consistency we won't
log local ones as well.

There is a database counter for timed out writes which can be
consulted in order to check if they're occuring.

Perhaps this would be better solved by a generic log message
throttling/coalescing mechanism, but that's not ready yet.
2016-11-29 16:40:59 +01:00
Tomasz Grabiec
b5d5612f98 database: Add counter for timed out writes 2016-11-29 16:40:59 +01:00
Tomasz Grabiec
14cb31f69a storage_proxy: Delay timeout response until background work ceases
Write requests which timed out may still occupy memory for a while due
to local write. It should time out soon as well but there is a time
window in which it has not yet. If we don't delay timeout response,
the request would be seen as not consuming any memory too early. This
in turn would cause the CQL server to allow more requests than we
want. In some cases causing OOM or exceeding memory limits and causing
excessive cache eviciton.

Fixes #1756.
2016-11-29 16:40:59 +01:00
Tomasz Grabiec
ba3779802f storage_proxy: Propagate timeout to local writes 2016-11-29 16:40:59 +01:00
Tomasz Grabiec
6d195a1538 storage_proxy: Use shared ownership for abstract_write_response_handler 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
5805330d98 storage_proxy: Add counter for all alive write handlers
Currently the counter uses _response_handlers.size(), but after later
patches we may have an active (timed out) write with no response
handler, so count live instances instead.
2016-11-29 16:40:58 +01:00
Tomasz Grabiec
2c561ecaed db: Allow writes to be timed out 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
b1ae6ad2ad db: Introduce counters for failed reads and writes 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
31645e2c4a commitlog: Allow allocations to be timed out 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
e14caaef60 utils/logalloc: Add ability to timeout run_when_memory_available() task 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
61d81617e1 utils/flush_queue: Add ability to wait with a timeout 2016-11-29 16:40:58 +01:00
Raphael S. Carvalho
a16425833c size_tiered: do not recreate bucket when it goes beyond max threshold
Problem will cause size tiered to return small jobs when there are
more than max_threshold sstables of similar size. For example, if
max_threshold is 32, and there are 36 sstables of similar size,
strategy will only return 4 sstables to be compacted. That's because
we incorrectly create a new bucket when it meets the max threshold.
What we should do is to allow buckets to grow beyond max threshold
and trim them when selecting the most suitable one for compaction.

Important to mention that estimation for size tiered will now
work better when there are more than max_threshold sstables of
similar size.

Fixes #1901.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <080bad70d6cb86eaf52ac1bdd6765ac47aab5b03.1478316140.git.raphaelsc@scylladb.com>
2016-11-29 16:56:02 +02:00
Glauber Costa
353a4cd2d4 commitlog: sync segments before acquiring semaphore on shutdown.
Sync all segments before acquiring the semaphore, otherwise waiting may
have to wait for the timer to kick in and push them down.
Note that we can't guarantee that no other requests were executed in the
mean time, so we have to sync again.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>
2016-11-29 11:07:28 +02:00
Tomasz Grabiec
96c7764458 Revert "prevent commitlog replay position reordering during reserve refill"
This reverts commit 0e9b75d406.

commitlog_test fails with this:

Running 14 test cases...
ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed
ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required)

*** 1 failure detected in test suite "tests/commitlog_test.cc"
WARN  2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
2016-11-28 20:52:13 +01:00
Raphael S. Carvalho
f141b0cdae database: atomically add new sstables to cf when refreshing
New sstables are loaded and added in parallel, meaning that scylla can
potentially return stale data if a new sstable containing a tombstone
wasn't loaded yet. Compaction should also not run until all new sstables
are added for similar reasons.

Fix is about separating blocking and non-blocking steps to allow
atomic add of multiple new sstables.

Fixes #1368.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>
2016-11-28 20:30:48 +02:00
Glauber Costa
0e9b75d406 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>
2016-11-28 19:26:26 +01:00
Takuya ASADA
1042e40188 dist/common/scripts/scylla_kernel_check: fix incorrect document URL
Fixes #1871

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480327243-18177-1-git-send-email-syuu@scylladb.com>
2016-11-28 13:51:19 +02:00
Avi Kivity
18df2d9e9e partition_version: fix const correctness in rows_entry_compare
Using a non-const-correct comparator results in build failures with
boost 1.55.

Fixes #1892.
Message-Id: <20161128104335.28789-1-avi@scylladb.com>
2016-11-28 10:55:12 +00:00
Avi Kivity
5358984982 Merge seastar upstream
* seastar 93c3b12...7790e68 (7):
  > core/reactor: Introduce reactor-*/dervie-busy_ns metric
  > Collectd: Hold a reference to the metrics implementation in registration
  > future: Improve comments
  > fstream: actually use dynamically adjusted buffer
  > debug: add latency detector script
  > reactor: add static probes for latency detector
  > semaphore: Fix with_semaphore() in case wait() throws
2016-11-28 11:05:59 +02:00
Avi Kivity
28857e42e7 Merge " Virtualize size_estimates system table" from Duarte
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.

This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.

Fixes #1616"

* 'virtual-estimates/v4' of github.com:duarten/scylla:
  size_estimates_virtual_reader: Add unit test
  db: Delete size_estimates_recorder
  size_estimates: Add virtual reader
  column_family: Add support for virtual readers
  storage_service: get_local_tokens() returns a future
  nonwrapping_range: Add slice() function
  range: Find a sequence's lower and upper bounds
  system_keyspace: Build mutations for size estimates
  size_estimates: Store the token range as bytes
  range_estimates: Add schema
  murmur3_partitioner: Convert maximum_token to sstring
2016-11-28 10:12:59 +02:00
Avi Kivity
176fca5775 logalloc: use correct header for unique_ptr
<bits/unique_ptr.hh> is a libstdc++ internal header.  USe <memory> instead.
2016-11-27 23:08:04 +02:00
Glauber Costa
c32803f2f0 database: move reversion of virtual dirty state closer to update_cache.
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.

During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.

This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
2016-11-24 18:18:15 +01:00
Raphael S. Carvalho
4781b6eb71 sstables: use nonwrapping_range::make to avoid compilation issues
GCC 5.3.1 was unable to convert bound to optional<bound>.

sstables/sstables.cc:2494:123: error: no matching function for call to
‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position,
dht::ring_position)’
(dtr.right.exclusive ? dht::ring_position::starting_at :
dht::ring_position::ending_at)(std::move(t2)));

In file included from ./dht/i_partitioner.hh:52:0,
                 from ./query-request.hh:28,
                 from ./clustering_key_filter.hh:27,
                 from sstables/sstables.hh:35,
                 from sstables/sstables.cc:38:
./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range(
const wrapping_range<U>&) [with T = dht::ring_position]
     explicit nonwrapping_range(const wrapping_range<T>& r)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>
2016-11-24 11:26:16 +02:00
Duarte Nunes
cc3f26c993 lz4: Conditionally use LZ4_compress_default()
Since not all distributions have a version of LZ4 with
LZ4_compress_default(), we use it conditionally.

This is specially important beginning with version 1.7.3 of LZ4,
which deprecates the LZ4_compress() function in favour of
LZ4_compress_default() and thus prevents Scylla from compiling
due to the deprecated warning.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161124092339.23017-1-duarte@scylladb.com>
2016-11-24 11:25:03 +02:00
Avi Kivity
1be95b1227 Merge seastar upstream
* seastar d6f26d8...93c3b12 (3):
  > rpc: Conditionally use LZ4_compress_default()
  > queue: allow queue to change its maximum size
  > util/defer: add missing return to move assignment
2016-11-24 11:00:53 +02:00
Duarte Nunes
a527ba285f thrift: Don't apply cell limit across rows
In Thrift, SliceRange defines a count that limits the number of cells
to return from that row (in CQL3 terms, it limits the number of rows
in that partition). While this limit is honored in the engine, the
Thrift layer also applies the same limit, which, while redundant in
most cases, is used to support the get_paged_slice verb.

Currently, the limit is not being reset per Thrift row (CQL3
partition), so in practice, instead of limiting the cells in a row,
we're limiting the rows we return as well. This patch fixes that by
ensuring the limit applies only within a row/partition.

Fixes #1882

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161123220001.15496-1-duarte@scylladb.com>
2016-11-24 10:38:31 +02:00
Takuya ASADA
ce80fb3a39 dist/ubuntu: increase number of open files on Ubuntu 14.04(upstart)
Follow the change of NOFILE for non-systemd environment.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479975050-14907-1-git-send-email-syuu@scylladb.com>
2016-11-24 10:13:41 +02:00
Avi Kivity
d58c8aaa32 db: remove unused belongs_to_{current,other}_shard(s) functions
Obsoleted by new sharding mechanism, but break the build for some.
2016-11-23 21:39:29 +02:00
Avi Kivity
b81a57e8eb config, dht: reduce default msb ignore bits to 4
With the default value of 12, a node's range is partitioned into
4096 * smp::count sub-ranges which are queried sequentually for a range
scan.  If the number of rows in the table is smaller than the required
result size, we will query all of them.  This can take so long that we
time out.

A better fix is to query multiple sub-ranges in parallel and merge them,
but for that we need to resurrect the non-sequential merger.
2016-11-23 21:25:37 +02:00
Pekka Enberg
c526a9f0be Update seastar submodule
* seastar 7473945...d6f26d8 (2):
  > semaphore_units: add missing return statement
  > metrics: Do not detroy the metrics layer if it is been used
2016-11-23 20:27:09 +02:00
Paweł Dziepak
919825a2c7 Merge "Improve sharding in large clusters" from Avi
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively.  Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.

This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes.  Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.

In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.

Fixes #1277."
2016-11-23 11:25:53 +00:00
Glauber Costa
18b9fa3d43 dist: increase number of open files
This limit was found to be too low for production environments. It would
be hit at boot, when we're touching a lot of files from multiple shards
before deciding that we don't need them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <87bbf43da1a67f5fa6174017205c6ef8bdb0dc3d.1479829232.git.glauber@scylladb.com>
2016-11-23 13:10:25 +02:00
Avi Kivity
07d5a20bae Wire up sharding ignore msb parameter to configuration
We might have used a fancy map<sstring, any> to pass the parameters, but
that's overkill for now.
2016-11-22 22:40:47 +02:00
Avi Kivity
8b1d689de8 partitioner: add ignore_msb parameters to byte ordered and random partitioners
Ignored; doesn't make sense on byte ordered, and random is deprecated.
2016-11-22 21:56:42 +02:00
Avi Kivity
af16c0fac4 murmur3_partitioner: shard on the middle token bits, not most significant bits
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.

This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy.  In effect, with changes the token range layout from

   shard 0
   shard 1
   ...
   shard S-1

to

   shard 0
   shard 1
   ...
   shard S-1

   shard 0
   shard 1
   ...
   shard S-1

   ...

   shard 0
   shard 1
   ...
   shard S-1

Where the number of repetitions of the block is 2^(ignored msb bits).

For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
2016-11-22 21:56:42 +02:00
Avi Kivity
024c8ef8a1 db: adjust sstable load to use sstable self-reporting of shard ownership
Instead of calculating the owning shard from the sstable's partition
key range, delegate to the new sstable method for getting owning shard
infomation.  This insulates us from changes in the sharding algorithm.
2016-11-22 21:56:40 +02:00
Avi Kivity
98a4544e1c sstables: add method to get sstable owning shards from an unloaded sstable
When we load an sstable, we don't know beforehand which shards it belongs
to; we don't want to open it until we do.  Add a method that allows us
to read just the sharding data, without opening anything else.
2016-11-22 21:52:23 +02:00
Avi Kivity
bdd11648ac sstables: add intra-node sharding metadata
Add a metadata component that describes token ranges that are spanned by
this sstable.  With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
2016-11-22 21:44:25 +02:00
Avi Kivity
316ef1d70a sstables: automate writing statistics components
Add a virtual funnction to metadata_base so we can loop over statistics
components when writing them.
2016-11-22 21:05:06 +02:00
Glauber Costa
13973e7f3b keep background work semaphore alive during sstable flush
We have a semaphore controlling the amount of background work generated
by the memtable flush process. However, because we are not moving it
inside the memtable post-flush continuation, the units are being
released when we star the flush and not when we finish it.

That's not the intended behavior and that can cause flushes to
accumulate.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>
2016-11-22 19:54:08 +01:00
Avi Kivity
d05b22e502 sstables: automatically calculate offsets in statistics
Instead of calculating the offset for each statistic component manually,
use a loop to iterate over all components, accumulating the offset as we
go along.
2016-11-22 20:35:24 +02:00
Avi Kivity
7c5e6525ef sstables: switch statistics components to generic serialized_size() implementation 2016-11-22 20:20:38 +02:00
Avi Kivity
096ae59a5b sstables: introduce generic serialized_size()
Introduce a new function that reuses the file_writer code to compute
the serialized size of an sstable object, by serializing it into memory
and discarding the result.
2016-11-22 20:06:23 +02:00
Avi Kivity
3c06ffac9d sstables: const correctness for the write(file_writer&, T&) functions
write() doesn't need to change its input; so change it to const.

The only snag is that describe_type() isn't and can't be made const-correct,
so cheat when it is called and const_cast the input.

This helps in writing a generic serialized_size() that is const correct,
in the next patch.
2016-11-22 20:04:27 +02:00
Tomasz Grabiec
eefc538225 Update seastar submodule
* seastar 7504026...7473945 (1):
  > Merge "Improve support for timeouts in primitives"
2016-11-22 17:51:29 +01:00
Glauber Costa
0b8b5abf16 commitlog: acquire semaphore earlier
Recently we have changed our shutdown strategy to wait for the
_request_controller semaphore to make sure no other allocations are
in-flight. That was done to fix an actual issue.

The problem is that this wasn't done early enough. We acquire the
semaphore after we have already marked ourselves as _shutdown and
released the timer.

That means that if there is an allocation in flight that needs to use a
new segment, it will never finish - and we'll therefore neve acquire
the semaphore.

Fix it by acquiring it first. At this point the allocations will all be
done and gone, and then we can shutdown everything else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>
2016-11-21 22:19:32 +00:00
Avi Kivity
6bdb8ba31d storage_proxy: don't query concurrently needlessly during range queries
storage_proxy has an optimization where it tries to query multiple token
ranges concurrently to satisfy very large requests (an optimization which is
likely meaningless when paging is enabled, as it always should be).  However,
the rows-per-range code severely underestimates the number of rows per range,
resulting in a large number of "read-ahead" internal queries being performed,
the results of most of which are discarded.

Fix by disabling this code. We should likely remove it completely, but let's
start with a band-aid that can be backported.

Fixes #1863.

Message-Id: <20161120165741.2488-1-avi@scylladb.com>
2016-11-21 18:19:46 +02:00
Glauber Costa
0ca8c3f162 database: keep a pointer to the memtable list in a memtable
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.

Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.

That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.

Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.

At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.

The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.

Fixes #1870

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
2016-11-21 18:18:27 +02:00
Duarte Nunes
def2bc72b0 size_estimates_virtual_reader: Add unit test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
6a37d87c76 db: Delete size_estimates_recorder
Now that access to the size_estimates system is virtualized, we no
longer need the recorder.

Fixes #1616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
225648780d size_estimates: Add virtual reader
This patch add a virtual mutation_reader so that queries
to the size_estimates system table are handled by the engine
without needing to perform any IO.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
cd7e2fd602 column_family: Add support for virtual readers
Virtual readers allow queries to selected tables, usually system
tables, to be answered by the engine. This is useful for tables which
aren't written by users and whose contents can be calculated on
demand.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
c0d450c57d storage_service: get_local_tokens() returns a future
This patch changes the get_local_tokens() function in storage_service
to return a future instead of requiring running under a seastar::thread.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
9b384d375f nonwrapping_range: Add slice() function
This patch add the slice() function to nonwrapping range, which uses
its bounds to slice an input sequence.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
bdba8d99c3 range: Find a sequence's lower and upper bounds
This patch extracts a pair of functions from mutation_partition to
calculate the lower and upper bounds of a sequence from a
nonwrapping_range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
636287fdf2 system_keyspace: Build mutations for size estimates
This patch adds a function to system_keyspace responsible for creating
a mutation to a partition of the size_estimates system table from a
set of range_estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
18ddec245e size_estimates: Store the token range as bytes
This patch changes the range_estimates struct so that the tokens are
represented as utf8 encoded bytes. This will make future patches
require less conversions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:14:21 +00:00
Duarte Nunes
e7a5162c1d range_estimates: Add schema
This will be used in future patches, when virtualizing the
size_estimates system table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Duarte Nunes
01815ecd24 murmur3_partitioner: Convert maximum_token to sstring
This patch ensures we can convert the maximum_token to an sstring.
For Cassandra, the minimum and maximum tokens have the same
representation. So, we use the string representation of the
maximum_token for the maximum_token.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Takuya ASADA
eee63027e5 dist/ami/build_ami.sh: update base AMI to CentOS7-Base5
To drop unnecessary .ssh/authozied_keys, we need to update base AMI.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479496938-29724-1-git-send-email-syuu@scylladb.com>
2016-11-21 10:12:47 +02:00
Avi Kivity
783729c540 Merge "Clean up T::memory_usage() function" from Paweł
"This series is just a cleanup which intention is to deal with all
confusion related to the way T::memory_usage() functions work.

* T::memory_usage() which returned external memory usage are renamed
  to T::external_memory_usage()
* T::memory_usage() is introduced where needed to avoid repeating
  sizeof(T) + T::external_memory_usage()"

Paweł Dziepak (6):
  rename memory_usage() to external_memory_usage() where applicable
  streamed_mutation: add memory_usage() to mutation fragment types
  keys: add memory_usage()
  partition_snapshot_accounter: use range_tombstone::memory_usage()
  mutation_rebuilder: use memory_usage()
  frozen_mutation: use memory_usage()
2016-11-21 10:11:39 +02:00
Avi Kivity
498887ca0d Merge seastar upstream
* seastar 31c5fd7...7504026 (2):
  > circular_buffer: add move assignment operator
  > scollectd: Fix serialization of GAUGE-typed values
2016-11-20 20:16:56 +02:00
Gleb Natapov
9222a47fed sstable test: add test for generated summary data
Message-Id: <20161117155051.GV6765@scylladb.com>
2016-11-20 19:50:45 +02:00
Glauber Costa
21c1e2b48c commitlog: wait for pending allocations to finish before closing gate.
allocations may enter the gate, so it would be wise for us to wait for them.

Fixes #1860

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <53cd6996c1cbd8b38bab3b03604bd11e5c20beda.1479650012.git.glauber@scylladb.com>
2016-11-20 19:45:33 +02:00
Avi Kivity
a39b92a40a build: fix tests-with-symbols generation
Bad indentation caused the libs variable for tests-with-symbols to be
overwritten, resulting in link failure.
2016-11-20 17:23:26 +02:00
Glauber Costa
504b5ac30f database: don't check for waiters in the condition variable predicate.
In the last iterations of this patchset, we have moved explicit flushes
to acquire the semaphore directly and the coalescing inside the
memtable_list. As a result, we are no longer keeping any kind of action
for them inside the condition variable. Checking for them has no longer
a purpose.

This is a cleanup patch that remove does checks.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>
2016-11-18 21:34:48 +01:00
Glauber Costa
1933349654 database: fix direct flushes of non-durable column families.
If a Column Family is non-durable, then its flushes will never create a
memtable flush reader. Our current flush logic depends on that being
created and destroyed to release the semaphore permits on the flush.

We will remove the permits ourselves it there is an exception, but not
under normal circumnstances. Given this issue, however, it would be more
adequate to always try to remove the permits after we flush. If the
permits were already removed by the flush reader, then this test will
just see that the permit is not in the map and return. But if it is
still there, then it is removed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>
2016-11-18 21:32:29 +01:00
Avi Kivity
6eecbc80dc CONTRIBUTING.md: add sections for help and issues
Don't scare away users reporting an issue with the CLA.
2016-11-18 22:21:10 +02:00
Glauber Costa
60b7d35f15 commitlog: close file after read, and not at stop
There are other code paths that may interrupt the read in the middle
and bypass stop. It's safer this way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8c32ca2777ce2f44462d141fd582848ac7cf832d.1479477360.git.glauber@scylladb.com>
2016-11-18 14:09:33 +00:00
Paweł Dziepak
249e0ab087 frozen_mutation: use memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
948c062e64 mutation_rebuilder: use memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
e04664e851 partition_snapshot_accounter: use range_tombstone::memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
711bd19f16 keys: add memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
6b8bf030c0 streamed_mutation: add memory_usage() to mutation fragment types
This patch introduces memory_usage() to static_row, clustering_row and
range_tombstone so that we can avoid repeating sizeof(T) +
x.external_memory_usage().

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
ef57b9a26f rename memory_usage() to external_memory_usage() where applicable
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Avi Kivity
fec4ef3390 Merge "Make sure commitlog replay is able to make progress" from Glauber
"Fixes #1856

Commitlog replay reads are being issued without a priority. That means
they will lose to compaction every time."

* 'issue-1856-v2' of github.com:glommer/scylla:
  commitlog: use read ahead for replay requests
  commitlog: use commitlog priority for replay
  commitlog: close replay file
2016-11-18 12:04:18 +02:00
Takuya ASADA
55e5123313 dist/redhat: Support RHEL7
We supported install CentOS7 .rpm on RHEL7, but we haven't supported
building on RHEL7, since there is little difference between CentOS,
and that causes build error.

This patch fixes the error, now we can produce .rpm for RHEL7 wihout
using CentOS.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479431134-8032-1-git-send-email-syuu@scylladb.com>
2016-11-18 11:56:05 +02:00
Glauber Costa
461778918b fix shutdown and exception conditions for flush logic
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
  the explicit flushes no longer wait on the condition variable. So we
  don't.
- We now wait on the stop() flushes (regardless of their return status)
  so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
  to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
  handler

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
2016-11-17 21:16:44 +01:00
Glauber Costa
59a41cf7f1 commitlog: use read ahead for replay requests
Aside from putting the requests in the commitlog class, read ahead
will help us going through the file faster.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:54 -05:00
Glauber Costa
aa375cd33d commitlog: use commitlog priority for replay
Right now replay is being issued with the standard seastar priority.
The rationale for that at the time is that it is an early event that
doesn't really share the disk with anybody.

That is largely untrue now that we start compactions on boot.
Compactions may fight for bandwidth with the commitlog, and with such
low priority the commitlog is guaranteed to lose.

Fixes #1856

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:02 -05:00
Glauber Costa
4d3d774757 commitlog: close replay file
Replay file is opened, so it should be closed. We're not seeing any
problems arising from this, but they may happen. Enabling read ahead in
this stream makes them happen immediately. Fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 12:35:24 -05:00
Avi Kivity
eaf83ab59c Merge seastar upstream
* seastar 3001c08...31c5fd7 (2):
  > Safe use of collectd during shutdown
  > udp: abort reader and writer when udp channel close
2016-11-17 18:44:28 +02:00
Piotr Jastrzebski
9d33948487 mutation_rebuilder: fix fragment size calculation
It wasn't calculating the size of data correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c03dfff7bf1ca3199991e5864189f98bfa2942ea.1479397736.git.piotr@scylladb.com>
2016-11-17 16:23:42 +00:00
Raphael S. Carvalho
3dc9294023 db: do not leak deleted sstable when deletion triggers an exception
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.

The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.

We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.

Fixes #1840.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
2016-11-17 17:46:36 +02:00
Gleb Natapov
c052a1bc4f sstable: use schema's min_index_interval config when generating missing summary
Message-Id: <20161116181937.GA25303@scylladb.com>
2016-11-17 15:24:03 +02:00
Avi Kivity
5d067eebf2 Merge "get rid of memtable size parameter and rework flush logic" from Glauber
"This patchset allows Scylla to determine the size of a memtable instead
of relying in the user-provided memtable_cleanup_threshold. It does that
by allowing the region_group to specify a soft limit which will trigger
the allocation as early as it is reached.

Given that, we'll keep the memtables in memory for as long as it takes
to reach that limit, regardless of the individual size of any single one
of them. That limit is set to 1/4 of dirty memory. That's the same as
last submission, except this time I have run some experiments to gauge
behavior of that versus 1/2 of dirty memory, which was a preferred
theoretical value.

After that is done, the flush logic is reworked to guarantee that
flushes are not initiated if we already have one memtable under flush.
That allow us to better take advantage of coalescing opportunities with
new requests and prevents the pending memtable explosion that is
ultimately responsible for Issue 1817.

I have run mainly two workloads with this. The first one a local RF=1
workload with large partitions, sized 128kB and 100 threads. The results
are:

Before:

op rate                   : 632 [WRITE:632]
partition rate            : 632 [WRITE:632]
row rate                  : 632 [WRITE:632]
latency mean              : 157.8 [WRITE:157.8]
latency median            : 115.5 [WRITE:115.5]
latency 95th percentile   : 486.7 [WRITE:486.7]
latency 99th percentile   : 534.8 [WRITE:534.8]
latency 99.9th percentile : 599.0 [WRITE:599.0]
latency max               : 722.6 [WRITE:722.6]
Total partitions          : 189667 [WRITE:189667]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

After:

op rate                   : 951 [WRITE:951]
partition rate            : 951 [WRITE:951]
row rate                  : 951 [WRITE:951]
latency mean              : 104.8 [WRITE:104.8]
latency median            : 102.5 [WRITE:102.5]
latency 95th percentile   : 155.8 [WRITE:155.8]
latency 99th percentile   : 177.8 [WRITE:177.8]
latency 99.9th percentile : 686.4 [WRITE:686.4]
latency max               : 1081.4 [WRITE:1081.4]
Total partitions          : 285324 [WRITE:285324]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

The other workload was the workload described in #1817. And the result
is that we now have a load that is very stable around 100k ops/s and
hardly any timeouts, instead of the 1.4 baseline of wild variations
around 100k ops/s and lots of timeouts, or the deep reduction of
1.5-rc1."

* 'issue-1817-v4' of github.com:glommer/scylla:
  database: rework memtable flush logic
  get rid of max_memtable_size
  pass a region to dirty_memory_manager accounting API
  memtable: add a method to expose the region_group
  logalloc: allow region group reclaimer to specify a soft limit
  database: remove outdated comment
  database: uphold virtual dirty for system tables.
2016-11-17 14:36:43 +02:00
Avi Kivity
18078bea9b storage_proxy: avoid calculating digest when only one replica is contacted
If we're talking to just one replica, the digest is not going to be used,
so better not to calculate it at all.  The optimization helps with
LOCAL_ONE queries where the result is large, but does not contain large
blobs (many small rows).

This patch adds a digest_algorithm parameter to the READ_DATA verb that
can take on two values: none and MD5 (default), and sets it to none when
we're reading from one replica.

In the future we may add other values for more hardware-friendly digest
algorithms.
Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>
2016-11-17 13:04:30 +02:00
Asias He
dc50ce0ce5 streaming: Make the mutation readers when streaming starts
Currenlty we make the mutation readers for streaming at different
time point, i.e.,

do_for_each(_ranges.begin(), _ranges.end(), [] (auto range) {
     make a mutation reader for this range
     read mutations from the reader and send
})

If there are write workload in the background, we will stream extra
data, since the later the reader is made the more data we need to send.

Fix it by making all the readers before starting to stream.

Fixes #1815
Message-Id: <1479341474-1364-2-git-send-email-asias@scylladb.com>
2016-11-17 12:41:53 +02:00
Gleb Natapov
ae0a2935b4 sstables: fix ad-hoc summary creation
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
2016-11-17 11:05:23 +02:00
Glauber Costa
f08162e181 database: rework memtable flush logic
The way we currently flush memtables, we seal the current one but wait
on a semaphore for the actual flush to proceed.

This is pointless, because if the flush is not proceeding we'll use up
memory for the new entries anyway, be them in a newly opened memtable or
not. As a matter of fact, by opening a new memtable we are foregoing
coalescing opportunities.

After recent changes to the flush paths, we are now in a position to do
differently. We move the semaphore earlier, and if we can't acquire it
we keep appending to the current memtable.

For explicit flushes, we'll queue and prioritize them over memory-based
flushes. This has the nice property of potentially coalescing various
flushes for the same CF into one.

Coalescing flushes for the same CF is particularly helpful for
commitlog-initiated flushes that can't complete within the flush period.
What we see currently, is that under heavy load the commitlog will keep
sealing memtables adding to the existing load.

Another interesting property of this approach is that we can keep the
disk utilization higher, by allowing a new flush to start before the
memtable is fully sealed. By design, every time a memtable is finished
flushing it will call revert_potentially_cleaned_up_memory() to revert
the virtual memory charges.  That is the perfect moment for us to act.
It indicates that all the data flushing part is done.

The way we'll do it is by keeping the semaphore_units alive for this
memtable. When the flush ends, we destroy that object. This will
effectively trigger the next flush if there is a next flush that can be
initiated.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:58 -05:00
Glauber Costa
895e838ac0 get rid of max_memtable_size
After recent changes to the memtable code, there is no reason for us to
uphold a maximum memtable size. Now that we only flush one memtable at a
time anyway, and also have soft limit notifications from the
region_group_reclaimer, we can just set the soft limit to the target
size and let all of that be handled by the dirty_memory_manager.

It does have the added property that we'll be flushing when we globally
reach the soft limit threshold. In conditions in which we have multiple
CF writes fighting for memory, that guarantees that we will start
flushing much earlier than the hard limit.

The threshold is set to 1/4 of dirty memory. While in theory we would
prefer the memtables to go as big as 1/2 of dirty memory, in my
experiments I have found 1/4 to be a better fit, at least for the
moment.

The reason for such behavior is that in situations where we have slow
disks, setting the soft limit to 1/2 of dirty will put us in a situation
in which we may not have finished writing down the memtable when we hit
the limit, and then throttle. When set the threshold to 1/4 of dirty, we
don't throttle at all.

This behavior could potentially be fixed by not doing the full
memtable-based throttling after we do the commitlog throttling, but that
is not something realistic for the moment.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
2ed3f342c1 pass a region to dirty_memory_manager accounting API
We would like to know from which region is a particular flush coming
from, and account accordingly. The reasoning behind that, is that soon
we'll be driving the flushes internally from the dirty_memory_manager
without explcitly triggering them.

We need to start a flush before the current one finishes, otherwise
we'll have a period without significant disk activity when the current
SSTable is being sealed, the caches are being updated, etc.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
0b337dab14 memtable: add a method to expose the region_group
That is technically not needed because a memtable inherits from group. So
whenever we have a memtable, we can use it's group() method to obtain a
group for it, and then from there go to the region_group.

However, region() is a const method in the memtable, so we have to play
trick with the const_cast, or remove the constness from the region. An
alternative to that, which I prefer, is to expose a method for the
region_group directly from the memtable object that does the right thing
and bypasses all that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
f86c9e36f4 logalloc: allow region group reclaimer to specify a soft limit
The region_group_reclaimer will let us know every time we are over the
limit we have specified for memory usage.

However, For some applications, we would be interested in knowing about
memory build up earlier, so we can start doing something about it before
we reach that condition.

This patch introduce soft limit notifications for the
region_group_reclaimer. After this patch is applied, start_reclaim() is
called earlier, and stop_reclaim() later, after the soft condition is
abated.

There are methods that allow one to easily test if the pressure
condition is a soft limit condition or a hard, threshold condition and
act accordingly. Whether to act on both conditions or just one of them
is up to the application.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
da738a6cd1 database: remove outdated comment
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
919de98aa5 database: uphold virtual dirty for system tables.
Currently the virtual dirty mechanism is not properly set for system
tables. We haven't divided the system table allowance by two, which
means it won't start thottling earlier as it was supposed to.

In practice, this has little effect because system table requests are
very well behaved, their sizes well known, and they tend to be
force-flushed. But we should be consistent.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Avi Kivity
f26c6569d2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 61ff5c6...25e101f (1):
  > scylla_install_ami: delete unneeded authorized_keys from AMI image
2016-11-16 22:36:31 +02:00
Takuya ASADA
3802f289f8 dist: remove bc from dependency
Since we replaced shellscript based cpuset generator with python based one,
we no longer depends to bc command.
d571123afd

So drop it from .rpm/.deb dependency.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479152876-11020-1-git-send-email-syuu@scylladb.com>
2016-11-16 15:02:55 +02:00
Amnon Heiman
a4be7afbb0 API: cache_capacity should use uint for summing
Using integer as a type for the map_reduce causes number over overflow.

Fixes #1801

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1479299425-782-1-git-send-email-amnon@scylladb.com>
2016-11-16 13:55:46 +01:00
Avi Kivity
31d0e31de2 Merge seastar upstream
* seastar 47e1821...3001c08 (5):
  > core: Introduce weak_ptr<>
  > timer: Add missing include
  > tutorial: fix TeX template
  > Merge "Adding the metrics layer" from Amnon
  > core/memory: let malloc(0) return a valid pointer
2016-11-16 14:20:49 +02:00
Pekka Enberg
8a4bd6ecd5 README: Guidelines for contributing
Message-Id: <1479288359-14168-1-git-send-email-penberg@scylladb.com>
2016-11-16 12:50:02 +02:00
Paweł Dziepak
f877be50b0 Merge "Keep wide partition cache entry longer than others" from Piotr
"Cache entries for wide partitions are usually smaller than other
entries and the cost of recreating them is higher so it makes sense
to keep them longer than ordinary entries."
2016-11-15 20:44:52 +00:00
Paweł Dziepak
b8d737ff0a tests/row_cache_test: verify that eviction follows lru
Refs #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:57:54 +01:00
Paweł Dziepak
999dafbe57 row_cache: touch entries read during range queries
Fixes #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479230809-27547-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:54:11 +01:00
Tomasz Grabiec
11c5f4ab50 storage_proxy: Add counters for throttled writes 2016-11-15 17:18:25 +01:00
Piotr Jastrzebski
5ec668c9c6 Add separate LRU for wide partitions.
Evict wide partitions only every 1000 normal partition
evictions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 16:19:13 +01:00
Piotr Jastrzebski
9a41bfbf69 Add collectd metric for wide partition evictions.
This will allow us to see how big is an amount
of evictions of cached info about wide partitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 15:53:14 +01:00
Paweł Dziepak
055d78ee4c query_pagers: distinct queries do not have clustering keys
Query pager needs to handle results that contain partitions with
possibly multiple clustering rows quite differently than results with
just one row per partition (for example a page may end in a middle of
partition). However, the logic dealing with partitions with clustering
rows doesn't work correctly for SELECT DISTINCT queries, which are
much more similar to the ones for schemas without clustering key.

The solution is to set _has_clustering_keys to false in case of SELECT
DISTINCT queries regardless of the schema which will make pager
correctly expect each partition to return at most one rows.

Fixes #1822.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478612486-13421-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 11:06:01 +01:00
Glauber Costa
93386bcec7 histograms: do not use latency_in_nano
Now that the histogram has its own unit expressed in its template
parameter, there is no reason to convert it to nano just so we may need
to convert it back if the histogram needs another unit.

This patch will keep everything as a duration until last moment, and
then we'll convert when needed.

This was suggested by Amnon.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>
2016-11-14 18:01:43 +02:00
Nadav Har'El
c5254b6502 repair: fix undefined variable
If the "trace" parameter of the repair was not given, we will use the
"trace" variable without setting it. We need to set a default value.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1479136239-14204-1-git-send-email-nyh@scylladb.com>
2016-11-14 17:16:19 +02:00
Raphael S. Carvalho
e86de40b49 compaction_manager: inform about compaction cancelled by shutdown
After some changes in compaction manager, user no longer is informed
that compaction was cancelled in event of shutdown. That's because
we only ignore ready future when compaction manager was asked to
stop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <02ca29b5a93fe3a558896598f325b0dce069e82c.1478277317.git.raphaelsc@scylladb.com>
2016-11-14 16:37:33 +02:00
Piotr Jastrzebski
4fe989d58e Cleanup sstables::mutation_reader::impl
Pointer to sstable seems unnecessary.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a45e8853af2b5f896ec44144fbc26d3325a5ec0c.1479123740.git.piotr@scylladb.com>
2016-11-14 11:52:52 +00:00
Avi Kivity
14c1b17105 storage_service: fix construct_range_to_endpoint_map with semi-infinite range
After the conversion to nonwrapping ranges, construct_range_to_endpoint_map()
may be called with semi-infinite token ranges, but it does not expect this,
calling nonwrapping_range::end()->value() unconditionally.

Fix by checking whether this is a semi-infinite range on the right, and
replace ->value() by maximum_token() instead.

Fixes `nodetool describering` (once more).
Message-Id: <1478983010-29630-1-git-send-email-avi@scylladb.com>
2016-11-14 11:39:48 +01:00
Raphael S. Carvalho
9a9f0d3a0f main: fix exception handling when initializing data or commitlog dirs
Exception handling was broken because after io checker, storage_io_error
exception is wrapped around system error exceptions. Also the message
when handling exception wasn't precise enough for all cases. For example,
lack of permission to write to existing data directory.

Fixes #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>
2016-11-14 12:34:10 +02:00
Takuya ASADA
d571123afd dist/common/scripts/scylla_sysconfig_setup: stop using 'bc' command to generate cpuset parameter, use python script instead
We get error from bc command when we run the script on >34 ncpus,
to prevent the issue add a python script to generate cpuset parameter.

Fixes #1824

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478887624-12737-1-git-send-email-syuu@scylladb.com>
2016-11-14 11:45:23 +02:00
Duarte Nunes
66f6a367a4 ring_position_range_sharder: Avoid copying eagerly
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161104115632.15974-1-duarte@scylladb.com>
2016-11-13 11:42:23 +02:00
Avi Kivity
bf20aa722b Merge "Fixes for histogram and moving average calculations" from Glauber
"JMX metrics were found to be either not showing, or showing absurd
values.  Turns out there were multiple things wrong with them. The
patches were sent separately but conflict with one another. This series
is a collection of the patches needed to fix the issues we saw.

Fixes #1832, #1836, #1837"
2016-11-13 11:16:32 +02:00
Avi Kivity
2670e46f3e storage_service: deinline most methods
Most inline methods in storage_service are too large to be inlined, and
just increase compile time.  De-inline them.
2016-11-12 21:12:28 +02:00
Glauber Costa
608d825790 histogram: fix reporting units
We are tracking latencies in microseconds, but almost everywhere else
they are reported in microseconds. Instead of just converting, this
patch tries to be a bit more future proof and embed the unit into the
type - and we then default to microseconds.

I have verified that the JMX measures now report sane values for both
the storage proxy and the column family. nodetool cfhistograms still
works fine. That one is reported in nanoseconds, but through the
estimated_histogram, not ihistogram.

Fixes #1836

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-11 11:36:56 -05:00
Glauber Costa
1342d044eb moving averages: change metrics calculation
We have recently fixed a bug due to which the constructor parameters for
moving average were inverted, leading to the numbers being just plain
wrong. However, the calculation of alpha was already inverted, meaning
it was right by accident and now that's wrong.

With the wrong alpha, the values we see are still correct, but they move
very quickly. The intention of this code is obviously to smooth things
out.

This was found out by Nadav. I have tested and confirmed that the smoothing
factor now works as expected.

Fixes  #1837

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-10 22:33:34 -05:00
Amnon Heiman
a977ea85e1 histogram: moving_average and total rate should be calculate in seconds
The moving average and the total average should be calculated in seconds
and not nanoseconds.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-11-10 22:32:53 -05:00
Glauber Costa
d3f11fbabf histogram: moving averages: fix inverted parameters
moving_averages constructor is defined like this:

    moving_average(latency_counter::duration interval, latency_counter::duration tick_interval)

But when it is time to initialize them, we do this:

	... {tick_interval(), std::chrono::minutes(1)} ...

As it can be seen, the interval and tick interval are inverted. This
leads to the metrics being assigned bogus values.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>
2016-11-10 11:28:51 -08:00
Paweł Dziepak
f16d6f9c40 partition_version: make sure that snapshot is destroyed under LSA
Snapshot destructor may free some objects managed by the LSA. That's why
partition_snapshot_reader destructor explicitly destroys the snapshot it
uses. However, it was possible that exception thrown by _read_section
prevented that from happenning making snapshot destoryed implicitly
without current allocator set to LSA.

Refs #1831.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478778570-2795-1-git-send-email-pdziepak@scylladb.com>
2016-11-10 13:13:10 +01:00
Gleb Natapov
27e041606b fix LOCAL_ONE printout
Message-Id: <20161109125307.GH7766@scylladb.com>
2016-11-09 12:53:55 +00:00
Duarte Nunes
e680587b8a sstable_test: Be explicit about uncompressed tables
After 7c28ed, the schemas defined in the test became compressed by
default. This patch changes the test so that it is explicit about
which schemas shouldn't define a compressor.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478646530-5558-1-git-send-email-duarte@scylladb.com>
2016-11-09 11:21:59 +02:00
Pekka Enberg
b3dea313dd Merge "API changes for Cassandra 3.x migration" from Calle
"Mostly small changes/additions to the API calls to match Cv3
 requirements/semantics, i.e. updated scylla-jmx can implement required
 nodetool etc calls in a working fashion."
2016-11-09 10:30:32 +02:00
Duarte Nunes
e33c02aa60 cql3: Disable compression on empty properties
The CQL 3.1 documentation specifies that for disabling compression,
users should use an empty string:

ALTER TABLE mytable WITH COMPRESSION = {'sstable_compression': ''};

However, Cassandra also accepts the absence of the sstable_compression
option to disable compression. The patch 7c28ed prevented this behavior
in Scylla, which this patch aims to fix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478639499-4183-1-git-send-email-duarte@scylladb.com>
2016-11-09 10:03:59 +02:00
Gleb Natapov
93f068bd44 storage_proxy: fix speculation target selection logic
Current speculation target selection logic has several bugs in multi-dc
setup. It may select a non local target for CL=LOCAL and it may select
more than one target to speculate, one of which is non local.

Examples:

1. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_QUORUM.

In this scenario db::filter_for_query() will return both replicas from
local DC and speculation target selection logic will peek one one which
will be in different DC.

2. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_ONE + RRD.DC_LOCAL

In this scenario db::filter_for_query() will return all nodes in local DC and
there already be enough nodes to speculate, but current logic will add
one node from non local dc as a speculation target.

The patch below fixed both of those scenarios.

Message-Id: <20161103154637.GS7766@scylladb.com>
2016-11-08 18:32:47 +01:00
Paweł Dziepak
a8308e2a8d row_cache: dummy entry does not count as partition
Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>
2016-11-08 13:54:44 +01:00
Piotr Jastrzebski
50b41f7d1d Fix row_cache_test
partition_range passed to row_cache::make_reader
has to be kept alive as long as the resulting reader
is used.

Otherwise weird things start to happen.

This used to work just because of a pure luck.
When I started changing the row_cache implementation
I run into very weird behaviors for this tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>
2016-11-08 13:28:53 +01:00
Calle Wilund
473326d49a api/column_family: Make mean row size return integral
As (at least) per C3, these metrics are integral in origin. Adapt.
(Other option would be to translate in jmx).
2016-11-08 12:22:04 +00:00
Calle Wilund
bd646a6755 repair (api): Add option handling (sort of) for nodetool default options 2016-11-08 12:22:04 +00:00
Calle Wilund
0181fc8159 api::cache_service: Add (dummy) calls for key&counter metrics 2016-11-08 12:22:04 +00:00
Calle Wilund
5eb54f9bc4 api::storage_service: c3 compat - make query keyspaces a trinary choice
all, user or non-local strategy ones.
2016-11-08 12:22:04 +00:00
Calle Wilund
3b7a7dd383 api::failure_detector: c3 compat - add endpoint phi value query 2016-11-08 12:22:04 +00:00
Calle Wilund
218df55349 failure_detector: add accessor and api shortcut for arrival samples 2016-11-08 12:22:04 +00:00
Calle Wilund
f9836cd23b api::endpoint_snitch: c3 compat - allow dc/rack query for broadcast 2016-11-08 12:22:04 +00:00
Calle Wilund
54ba06a8bf api::column_family: Add calls/parameters for c3 compatibility 2016-11-08 12:22:04 +00:00
Amnon Heiman
c8082ccadb API: fix a type in storage_proxy
This patch fixes a typo in the URL definition, causing the metric in the
jmx not to find it.

Fixes #1821

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1478563869-20504-1-git-send-email-amnon@scylladb.com>
2016-11-08 11:09:21 +02:00
Amos Kong
95fe88c1d3 scripts/scylla_current_repo: use HTTP to access downloads.scylladb.com
Https isn't available for downloads.scylladb.com, or we can access
it by https://s3.amazonaws.com/downloads.scylladb.com/...

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <d4b65e1724bbeb76c928790d5d3e95b91ee9db79.1478153034.git.amos@scylladb.com>
2016-11-08 11:03:50 +02:00
Avi Kivity
767cfb4fe9 storage_service: fix range wrapping in describe_ring even more
Commit 8fca1887c2 ("storage_service: fix range wrapping in
describe_ring") fixed incorrect range wrapping code for describe_ring,
but fails when the number of endpoints for a token is greater than one,
because the endpoints are stored in an unordered vector.

Fix by comparing the endpoints in a way that ignores their order.
Message-Id: <1478460826-15923-1-git-send-email-avi@scylladb.com>
2016-11-07 16:18:20 +01:00
Calle Wilund
11baf37ab5 commitlog: Prevent exceptions in stream::produce from being set twice
Fixes #1775
stream lacks a check "is_open", which is a bummer. We have to both
prevent exception propagation and add a flag of our own to make sure
exceptions in producer code reaches consumer, and does not simply
get lost in the reactor.
Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>
2016-11-07 11:41:33 +01:00
Tomasz Grabiec
e6cc0a2e10 Merge branch '1766/v1' from duarten/scylla.git
This patchset adds missing properties to the create_view_statement,
such as whether the view is compact or the order of its clustering
columns.

Fixes #1766
2016-11-07 10:44:24 +01:00
Takuya ASADA
0f1ba1a3bb dist/redhat: remove unused dependencies
Seems like we mistakenly added unneeded packages for BuildRequires when
we created .spec file, so remove them.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478504761-15067-1-git-send-email-syuu@scylladb.com>
2016-11-07 09:48:50 +02:00
Paweł Dziepak
985d2f6d4a Merge "Remove quadratic behavior from atomic sstable deletion" from Avi
"The atomic sstable deletion provides exception safety at the cost of
quadratic behavior in the number of sstables awaiting deletion.  This
causes high cpu utilization during startup.

Change the code to avoid quadratic complexity, and add some unit tests.

See #1812."
2016-11-04 15:48:04 +00:00
Avi Kivity
f75aceabc5 sstables: add unit tests for atomic deletion
We simulate shards deleting sstables, but this is all happening on a single
core, and no sstables are harmed during test execution.
2016-11-04 15:48:43 +02:00
Avi Kivity
f10b9906d8 sstables: move atomic deletion code to its own files
This will simplify unit testing.  We move generic code that
depends only on seastar, so compile time should not increase too much.
2016-11-04 15:47:35 +02:00
Avi Kivity
9e85653c33 sstables: make atomic_deletion_manager more abstract
Make the shard count and method of deleting sstables abstract, in order
not to require all that machinery for unit tests.
2016-11-04 15:44:09 +02:00
Avi Kivity
e527da1e3c sstables: wrap atomic deletion code in a class
This makes it easier to abstract and unit-test.
2016-11-04 15:44:07 +02:00
Avi Kivity
a05837936a sstables: remove quadratic behavior from atomic sstable deletions
In order to ensure exception safety, the atomic sstable deletion code
creates a copy of the list of sstables pending deletion, modifies that
copy, and then replaces the original data with the copy.  This guarantees
that any exception does not change the data, since the assignment does
not require allocation.

However, it does result in quadratic behavior.  During startup, all
sstables are loaded on each shard, and each shard deletes sstables that
are do not have any partitions served by that shard; this results in
almost all sstables being deleted from all shards, with all that work
going to shard 0; the list grows to O(nr sstables), and there are
O((nr sstables) * (nr shards)) operations to perform.

Fix by replacing the copy-modify-assign method with an in-place update,
but one that is designed to only commit changes after all allocations
have been made; in addition, instead of using a list, use a hash table,
removing another source of quadratic behavior.

Fixes #1812 (the quadratic beahvior part).
2016-11-04 15:42:44 +02:00
Avi Kivity
8fca1887c2 storage_service: fix range wrapping in describe_ring
describe_ring() tries to re-wrap the ranges, but fails because the ranges
are not sorted.  Adjust the code not to rely on sorting.
Message-Id: <1478198630-27483-1-git-send-email-avi@scylladb.com>
2016-11-04 10:48:14 +00:00
Paweł Dziepak
8afd9e52c7 Merge "Process range queries sequentially on shards" from Avi
"Currently, partition range queries are processed in parallel on all
shards. This is inefficient because we are likely to drop the results
from all but one shard, assuming a well-populated column family.  We
are multiplying our work by a factor of smp::count.

While this is worthwhile in its own right, it is really an excuse to
sneak in the range/shard generator (patch 5), which is preliminary for
a new sharding algorithm, dividing tokens among shards based on the
middle-significant bits rather than the most-siginificant bits (which
alias with vnodes)

Fixes #1573."
2016-11-04 09:58:04 +00:00
Tomasz Grabiec
c1a7e2090e Revert "database: change find_column_families signature so it returns a lw_shared_ptr"
This reverts commit f3528ede65.
2016-11-04 10:48:21 +01:00
Tomasz Grabiec
3b5ccda70e Revert "database: refactor code so apply_in_memory() is called only once"
This reverts commit 3f825f593d.
2016-11-04 10:48:18 +01:00
Tomasz Grabiec
6366eb5cf8 Revert "correctly calculate latencies for writes"
This reverts commit a382f10fc4.
2016-11-04 10:48:02 +01:00
Tomasz Grabiec
a5ee87611a Revert "database: when querying, move latency counter instead of copying"
This reverts commit 8840a5a593.
2016-11-04 10:47:58 +01:00
Tomasz Grabiec
f3c1ff78e6 Merge branch 'cql_read_write_counters-v4' from seastar-dev.git
New CQL counters from Vlad.
2016-11-04 09:19:07 +01:00
Avi Kivity
b3299d5bc3 storage_proxy: simplify range queries
Instead of asking a shard for cmd->partition_limit and cmd->row_limit,
just ask it for the number of partitions and rows still needed to
satisfy the query.  This removes the need to trim the shard's result.
2016-11-03 19:10:20 +02:00
Avi Kivity
a668e575f6 storage_proxy: execute multi-partition query sequentially over shards
Since every shard might cause the row_limit quota to be satisfied, every
shard might be the last one we need.  Hence it is better to process shards
sequentially, stopping if the quota is reached or the range is exhausted.

The original code tried to yield to reduce latency, but this is now
unnecessary, as we're doing a lot less work per iteration (if it becomes
necessary, we should do it on the replica shard, not the coordinating shard).
2016-11-03 19:10:20 +02:00
Avi Kivity
1d77e3a03a partitioner: add unit tests for token_for_next_shard()
i_partitioner::token_for_next_shard() is an inverse for
i_partitioner::shard_of(), test that this is so.
2016-11-03 19:10:20 +02:00
Avi Kivity
7202b94183 dht: introduce a sharder for vectors of partition ranges
Building on the single-range sharder, add a sharder for vectors of
partition ranges.  This helps with wrapped ranges, which are translated
into a vector containing two shards.
2016-11-03 19:10:20 +02:00
Avi Kivity
43a2380899 dht: add a generator for shard/range pairs
Divides a ring_position range into a sequence of shard/range pairs.  This
allows sequential iteration over shards in ring order.

The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page.  With the generator, we can switch to sequential
execution.
2016-11-03 19:10:17 +02:00
Avi Kivity
1f88d103a8 partitioner: add i_partitioner::token_for_next_shard()
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.

To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.

It is a sort-of inverse to shard_of(), in that

  shard_of(token_for_next_range(t)) == shard_of(t) + 1
2016-11-03 19:09:23 +02:00
Vlad Zolotarov
6c15dd967a cql3::query_processor: make the collectd metrics registration nicer
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:20 -04:00
Vlad Zolotarov
36cc351ae1 cql3::query_processor: add a counter for BATCH CQL statements
- Add a "batches" member to cql_stats.
   - Update it where appropriate.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:20 -04:00
Vlad Zolotarov
6e1d27bed1 cql3::query_processor: add a counter for a number of CQL modification requests ("writes")
- Add a inserts, updates, deletes members to cql_stats.
   - Store cql_stats& in a modification_statement and increment the corresponding counter according to the value of a "type" field.
   - Store cql_stats& in a batch_statement and increment the statistics for each BATCH member.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:15 -04:00
Vlad Zolotarov
fa4e1db0cb cql: add a counter for CQL read (SELECT) requests
- Add a "reads" counter to a cql3::cql_stats struct.
   - Store a reference for a query_processor::_cql_stats in the select_statement object.
   - Increment a "reads" counter where needed.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:48:57 -04:00
Vlad Zolotarov
7606588267 cql3::query_processor: add cql_stats
- Add cql_stats member.
   - Pass it to cql3::raw::parsed_statement::prepare() virtual method.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:48:57 -04:00
Glauber Costa
8840a5a593 database: when querying, move latency counter instead of copying
It is comprised of two time points. Let's move it instead of copying it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c7c155c77780e188bfbe05881c81ce86456016d5.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
a382f10fc4 correctly calculate latencies for writes
Right now we are calculating latencies only when we are about to add an
item to the memtable.

That's incorrect and misleading, for two reasons. First, it leaves the
commitlog latencies out. But second, it is done after the memtable wall
effect is applied, which means we are not counting throttle time neither
in the memtables or in the commitlog.

To do that, we'll start the latency_counter object as soon as possible
and move it all the way to apply_in_memory(). That should span the
entire write operation.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
3f825f593d database: refactor code so apply_in_memory() is called only once
There are two variants of apply_in_memory() being called in do_apply():
with and without the commitlog. The main differences are that when the
commitlog is involved, we need to wait for its future to complete before
moving to apply_in_memory. That can easily be factored out by providing
an always-ready future if we don't have the commitlog enabled, and
waiting on that.

The second, is that the commitlog version can cause apply_in_memory to
generate an exception if there is replay position reordering. However,
there is no harm in appending the exception handler to both versions. In
one of them it's an impossible exception, but that's fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8cee0cad9b1930a057a24e095f0a655069ae8be2.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
f3528ede65 database: change find_column_families signature so it returns a lw_shared_ptr
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.

If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Avi Kivity
6c45b0bae8 partitioner: make comparators public
The public comparison operators depend on global_partitioner(), and are
therefore less useful for tests.
2016-11-03 11:27:40 +02:00
Avi Kivity
6320181b97 partitioner: const correctness for comparators 2016-11-03 11:27:40 +02:00
Avi Kivity
470826d127 partitioner: change partitioners to have shard counts independent from smp::count
Useful for testing.
2016-11-03 11:27:40 +02:00
Avi Kivity
75706c0a26 size_estimates_recorder: sort token range before rewrapping it
Since size estimates are stored as wrapped ranges, we call compat::wrap()
to convert from the now-standard unwrapped ranges back to wrapped ranges.
However, compat::wrap() relies on the ranges being in sorted order,
but our input is not.  This leads to a crash as we find an unexpected
empty token in the middle of the vector.

Sort it so compat::wrap() works as expected.

Fixes #1804.
Message-Id: <1478161908-25051-1-git-send-email-avi@scylladb.com>
2016-11-03 09:43:41 +01:00
Avi Kivity
a35136533d Convert ring_position and token ranges to be nonwrapping
Wrapping ranges are a pain, so we are moving wrap handling to the edges.

Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.

Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
2016-11-02 21:04:11 +02:00
Takuya ASADA
8c55c99353 dist/common/scripts/scylla_io_setup: pass --smp option to iotune command
We were ignored --smp option taken from io.conf since iotune didn't supported
it, but now it supported we can pass it.
(We need to pass it because we need to measure io performance on same condition
with scylla)

Fixes #1768

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478082591-27205-1-git-send-email-syuu@scylladb.com>
2016-11-02 12:49:50 +02:00
Raphael S. Carvalho
53b7b7def3 sstables: handle unrecognized sstable component
As in C*, unrecognized sstable components should be ignored when
loading a sstable. At the moment, Scylla fails to do so and will
not boot as a result. In addition, unknown components should be
remembered when moving a sstable or changing its generation.

Fixes #1780.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b7af0c28e5b574fd577a7a1d28fb006ac197aa0a.1478025930.git.raphaelsc@scylladb.com>
2016-11-02 12:44:53 +02:00
Avi Kivity
72c2982260 dist: require scylla-boost-static for EL RPM build 2016-11-01 18:55:55 +02:00
Pekka Enberg
e1e8ca2788 cql3: Fix selecting same column multiple times
Under the hood, the selectable::add_and_get_index() function
deliberately filters out duplicate columns. This causes
simple_selector::get_output_row() to return a row with all duplicate
columns filtered out, which triggers and assertion because of row
mismatch with metadata (which contains the duplicate columns).

The fix is rather simple: just make selection::from_selectors() use
selection_with_processing if the number of selectors and column
definitions doesn't match -- like Apache Cassandra does.

Fixes #1367
Message-Id: <1477989740-6485-1-git-send-email-penberg@scylladb.com>
2016-11-01 09:09:01 +00:00
Pekka Enberg
d46ed53e9e scripts: add update-version
This patch adds an `update-version` script for updating the Scylla
version number in `SCYLLA-VERSION-GEN` file and committing the change to
git.

Example use:

  $ ./scripts/update-version 1.4.0

which results into the following git commit:

  commit 4599c16d9292d8d9299b40a3e44ef7ee80e3c3cf
  Author: Pekka Enberg <penberg@scylladb.com>
  Date:   Fri Oct 28 10:24:52 2016 +0300

      release: prepare for 1.4.0

  diff --git a/SCYLLA-VERSION-GEN b/SCYLLA-VERSION-GEN
  index 753c982..eba2da4 100755
  --- a/SCYLLA-VERSION-GEN
  +++ b/SCYLLA-VERSION-GEN
  @@ -1,6 +1,6 @@
   #!/bin/sh

  -VERSION=666.development
  +VERSION=1.4.0

   if test -f version
   then

Message-Id: <1477639560-10896-1-git-send-email-penberg@scylladb.com>
2016-10-30 12:43:41 +02:00
Avi Kivity
feb8faf70b Merge "make refresh resilient to permission denied error" from Raphael
Fixes #1709.

* 'refresh-resilient-v3' of github.com:raphaelsc/scylla:
  db: make refresh resilient to permission denied error
  db: make it possible to use custom error handler with io checker
  sstables: remove duplicated declaration of remove_by_toc_name
2016-10-30 10:28:09 +02:00
Takuya ASADA
68d9f5212c dist/ubuntu/dep/thrift.diff: add missing build time dependency
We need libcrypto header to build thrift, so add it.

Fixes #1798

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477676716-5726-1-git-send-email-syuu@scylladb.com>
2016-10-29 17:49:30 +03:00
Avi Kivity
71532d8cd5 Merge seastar upstream
* seastar 05f6c5c...47e1821 (1):
  > rpc: Avoid using zero-copy interface of output_stream (Fixes #1786)
2016-10-28 14:09:16 +03:00
Avi Kivity
e03ca06431 dist: fix rpm build
--static-boost is supposed to be an input to ./configure.py, not ninja.  Move
it there.
2016-10-28 08:42:26 +03:00
Pekka Enberg
b54870764f auth: Fix resource level handling
We use `data_resource` class in the CQL parser, which let's users refer
to a table resource without specifying a keyspace. This asserts out in
get_level() for no good reason as we already know the intented level
based on the constructor. Therefore, change `data_resource` to track the
level like upstream Cassandra does and use that.

Fixes #1790

Message-Id: <1477599169-2945-1-git-send-email-penberg@scylladb.com>
2016-10-27 23:37:26 +03:00
Glauber Costa
ef3c7ab38e auth: always convert string to upper case before comparing
We store all auth perm strings in upper case, but the user might very
well pass this in upper case.

We could use a standard key comparator / hash here, but since the
strings tend to be small, the new sstring will likely be allocated in
the stack here and this approach yields significantly less code.

Fixes #1791.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <51df92451e6e0a6325a005c19c95eaa55270da61.1477594199.git.glauber@scylladb.com>
2016-10-27 22:08:57 +03:00
Raphael S. Carvalho
d11e839520 db: make refresh resilient to permission denied error
User may forget to set permission of new sstables in upload dir
before refreshing them, and that will result in shutdown.
io_checker is now able to work with a custom handler, so all we
have to do is to whitelist EACCES.

Fixes #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 16:50:40 -02:00
Raphael S. Carvalho
a3e065da9b db: make it possible to use custom error handler with io checker
By default, io checker will cause Scylla to shutdown if it finds
specific system errors. Right now, io checker isn't flexible
enough to allow a specialized handler. For example, we don't want
to Scylla to shutdown if there's an permission problem when
uploading new files from upload dir. This desired flexibility is
made possible here by allowing a handler parameter to io check
functions and also changing existing code to take advantage of it.
That's a step towards fixing #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 15:54:21 -02:00
Takuya ASADA
a1b7e76d43 dist/ubuntu: support 16.10
Add 16.10 to 'supported_release'

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477585454-2115-1-git-send-email-syuu@scylladb.com>
2016-10-27 19:26:14 +03:00
Takuya ASADA
36e831a106 dist/common/scripts/scylla_bootparam_setup: support EC2 paravirtual instances
EC2 paravirtual instances uses pv-grub, which refers /boot/grub/menu.lst (grub0.9x config file) instead of grub2 config file.
So add boot parameters on /boot/grub/menu.lst when the file exists, and the instance is on EC2.

Fixes #1598

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472056875-17512-1-git-send-email-syuu@scylladb.com>
2016-10-27 18:55:05 +03:00
Avi Kivity
402a3f1c9f Merge seastar upstream
* seastar 9bed76a...05f6c5c (5):
  > reactor: improve task quota timer resolution
  > Update dpdk submodule to local-patches-20161027 tag
  > tests: wire up json_formatter_test
  > json_formatter_test: Add rudimentary json formatter test
  > scripts/posix_net_conf.sh: detect IRQs of virtio-net and xen_netfront correctly
2016-10-27 18:19:40 +03:00
Avi Kivity
e995f5a3a7 dist: statically link with boost on RHEL
Reduces runtime dependencies on Scylla-provided third-party boost packages.

Message-Id: <1477552490-28961-1-git-send-email-avi@scylladb.com>
2016-10-27 12:35:12 +03:00
Avi Kivity
76628a7b0b dist: make wget quieter
wget is often used from scripts recording to logs; as it emits a log
line every second, the logs are huge and unreadable.  Make it quieter.

Message-Id: <1477558534-32718-1-git-send-email-avi@scylladb.com>
2016-10-27 12:11:26 +03:00
Avi Kivity
72d78ffa7e Merge "Cache fixes" from Paweł
"5ff699e09fcbd62611e78b9de601f6c8636ab2f0 ("row_cache: rework cache to
use fast forwarding reader") brought some significant changes to the
row cache implementation. Unfortunately, "significant changes" often
translates to "more bugs" and this time was no different.

This series contains fixes for the problems introduced in that rework
and makes failing dtest
bootstrap_test.py:TestBootstrap.local_quorum_bootstrap_test
pass again."

* 'pdziepak/cache-fixes/v1' of github.com:cloudius-systems/seastar-dev:
  row_cache: avoid dereferencing invalid iterator
  row_cache: set _first_element flag correctly
  row_cache: fix clearing continuity flag at eviction
2016-10-27 11:44:15 +03:00
Takuya ASADA
5cb7dc5dc3 dist/ubuntu/dep: update thrift to 0.9.3
To make thrift compilable on gcc-6.2, we need to upgrade latest version of
thrift.
This is required to support Ubuntu 16.10.

Fixes #1784

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477517671-18067-1-git-send-email-syuu@scylladb.com>
2016-10-27 10:22:06 +03:00
Paweł Dziepak
a7224ae46e row_cache: avoid dereferencing invalid iterator
Conditions in row_cache::do_find_or_create_entry() make it possible that
std::prev(it) is going to be dereferenced even if it is a begin
iterator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:24:23 +01:00
Paweł Dziepak
654f651e0c row_cache: set _first_element flag correctly
If the continuity flag was set for the first element _first_element flag
would not be cleared. This shouldn't cause any correctness problems but
properly setting the flag allows to avoid some unnecessary key
comparisons.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:07:24 +01:00
Paweł Dziepak
567ff96f2a row_cache: fix clearing continuity flag at eviction
In original implementation the continuity flag indicated that cache has
full information about the range the between current partition and the
one following it, hence when evicting an entry the one preceeding it
had to have its continuity flag cleared.

This was changed, however, and now the continuiy flag tells whether the
cache is continuous between the current element and the one before it.
This means that eviction code needs to clear the flag for the entry
directly following the evicted one.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 14:58:20 +01:00
Raphael S. Carvalho
bc2d351c25 sstables: remove duplicated declaration of remove_by_toc_name
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-26 11:21:27 -02:00
Takuya ASADA
7617adadf4 dist/ami/files/.bash_profile: fix confusing message when running AMI on unsupported instance type
To describe witch instance type is supported, show document URL instead of
confusing message.

Fixes #1646

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477473336-25373-1-git-send-email-syuu@scylladb.com>
2016-10-26 12:48:51 +03:00
Avi Kivity
7faf2eed2f build: support for linking statically with boost
Remove assumptions in the build system about dynamically linked boost unit
tests.  Includes seastar update which would have otherwise broken the
build.
2016-10-26 08:51:21 +03:00
Piotr Jastrzebski
27726cecff Clean up position_in_partition.
Introduce position_in_partition_view and use it in
position() method in mutation_fragment, range_tombstone,
static_row and clustering_row.
Clean up comparators in position_in_partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c65293c71a6aa23cf930ed317fb63df1fdc34fd1.1477399763.git.piotr@scylladb.com>
2016-10-25 15:13:20 +01:00
Tomasz Grabiec
cbaae2bf7f Merge seastar upstream
* seastar e18205b...3777135 (1):
  > rpc: Do not close client connection on error response for a timed out request

Fixes #1778
2016-10-25 13:59:41 +02:00
Raphael S. Carvalho
975ce62dbc sstables: do not swallow exception when reading TOC
That caused problem when refreshing a sstable with bad permissions.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <48e5322c53234209e55da05c64c99b8ec4e190a3.1477372974.git.raphaelsc@scylladb.com>
2016-10-25 12:21:32 +03:00
Avi Kivity
ddd4dbf928 Update scylla-ami submodule
* dist/ami/files/scylla-ami e1e3919...61ff5c6 (1):
  > scylla_ami_setup: run posix_net_conf.sh when NCPUS < 8
2016-10-25 11:18:58 +03:00
Avi Kivity
4b55a687b6 Merge seastar upstream
* seastar 98b5a2d...e18205b (1):
  > json::formatter: Add formatters for maps + rudimentary test
2016-10-25 11:17:29 +03:00
Avi Kivity
e8edaaf6a4 Merge seastar upstream
* seastar 69acec1...98b5a2d (9):
  > rpc: Silence warning about ignored failed future
  > future: prioritise continuations that can run immediately
  > iotune: relax aio restrictions
  > build: support for static linking with boost
  > rpc: Fix crash during connection teardown
  > rpc: Move _connected flag to protocol::connection
  > rpc test: fail test if exception is thrown during test execution
  > rpc: do not assume underling semaphore type
  > rpc: fix default resource limit
2016-10-25 11:09:40 +03:00
Avi Kivity
fc8210a875 tests: fix tests with boost 1.60
In boost 1.60, the executable's command-line arguments are expected to
be separated from the boost command-line arguments by '--'.  Detect
this requirement and comply with it.
Message-Id: <1477212424-3831-1-git-send-email-avi@scylladb.com>
2016-10-24 09:36:56 +02:00
Avi Kivity
37f112b610 dist: add python3-yaml to ununtu dependencies for blocktune 2016-10-23 16:42:13 +03:00
Avi Kivity
7d50d6df9b blocktune: fix syntax error in exception handling 2016-10-23 16:40:00 +03:00
Avi Kivity
e261a380a9 dist: add PyYAML dependency to rpm (for blocktune) 2016-10-23 10:36:29 +03:00
Raphael S. Carvalho
fa308c079c database: fix collectd metrics for clustering key filter
Same instance name was used for exported metrics, which is
definitely wrong. Checked it works properly now via collectd
exporter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <471a36706113af60aeba86fb56a365feb4dab31a.1477086706.git.raphaelsc@scylladb.com>
2016-10-22 09:51:18 +03:00
Glauber Costa
a13c410749 commitlog: cycle based on total size, not on mutation size
We calculate two sizes during the allocation: "size", which is the
in-segment size of this mutation, and "s", which is that plus the
overhead. cycle() must be called with the latter, not the former, as
doing otherwise may lead to buffer overflows.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ccf346d8d0ebb44a1ba9fd069653bab0d7be0a61.1477063157.git.glauber@scylladb.com>
2016-10-21 18:57:41 +03:00
Glauber Costa
d9875784a1 commitlog: do not wait on pending operations for batch mode
This was explicitly mentioned in my set as gone in one of the versions.
Somehow it came back in the final version - sorry about that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2a0eba28cd74267d1a1fdcf1aef2901cc74ffc9f.1477059963.git.glauber@scylladb.com>
2016-10-21 17:27:16 +03:00
Vlad Zolotarov
f75a350a8f service::storage_proxy: use global_trace_state_ptr when using invoke_on
When trace_state may migrate to a different shard a global_trace_state_ptr
has to be used.

This patch completes the patch below:

commit 7e180c7bd3
Author: Vlad Zolotarov <vladz@cloudius-systems.com>
Date:   Tue Sep 20 19:09:27 2016 +0300

    tracing: introduce the tracing::global_trace_state_ptr class

Fixes #1770

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1476993537-27388-1-git-send-email-vladz@cloudius-systems.com>
2016-10-21 11:34:13 +03:00
Avi Kivity
e3ae54f0fe Merge "Rework commitlog to avoid timeouts" from Glauber
"This patchset reworks the commitlog logic to better handle conditions in
which we are getting requests faster than the disk can handle. It does
this by building a wall around the commitlog and only allowing
allocations to proceed when we are under the desired memory threshold.

The main advantage of that is that we can now easily set the commitlog
to work at disk speed, more or less allowing an "one byte in for each
byte out" approach instead of depending on the current cycle to finish.
As a result, max latencies are greatly reduced.

Testing Results
===============

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

I have run this with larger sizes as well, and it generally performs
much better than the baseline version. For sizes up to 5MB, I have seen
no timeouts in my setup. After that, I see some timeouts. Buffer
splitting is expected to make this better.

Aside from performance testing, this was also tested with batch and
periodic mode for various requests sizes."
2016-10-20 16:44:39 +03:00
Glauber Costa
d5618c6ace commitlog: add total_operations type for requests_blocked_memory
Current tracker for pending allocations is a queue_size GAUGE.  Add a
total_operations version so we have more insight on what's going on.

It will be called requests_blocked_memory for consistency with other
subsystems that track similar things.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-20 09:25:38 -04:00
Avi Kivity
db2f5e6be1 blocktune: wire up blocktune on startup
Message-Id: <1476357027-15014-3-git-send-email-avi@scylladb.com>
2016-10-20 13:24:05 +03:00
Avi Kivity
098d02ad1a scylla-blocktune: introduce
scylla-blocktune is a script that parses scylla.yaml and tunes the data file
and commitlog directories it references.

Tuning includes:
 - set the I/O scheduler to noop
 - disable merging
 - tune dependent devices (like RAID members)

Message-Id: <1476357027-15014-2-git-send-email-avi@scylladb.com>
2016-10-20 13:24:05 +03:00
Avi Kivity
fad34eef6c scylla_raid_setup: don't mess with read-ahead
It doesn't affect O_DIRECT reads, and it's not persistent.

Message-Id: <1476269082-2473-2-git-send-email-avi@scylladb.com>
2016-10-20 13:23:38 +03:00
Avi Kivity
a837da06ef scylla_raid_setup: increase chunk size
The current chunk size of 256 gives a 50% probability of a 128k read or
write getting split into two accesses.  This reduces efficiency and
increases latency.

Change the chunk size to 1MB, with a 12% probability of cross-member
access.

Message-Id: <1476269082-2473-1-git-send-email-avi@scylladb.com>
2016-10-20 13:23:38 +03:00
Takuya ASADA
80e3d8286c dist/ami: fix incorrect /etc/fstab entry on CentOS7 base image
There was incorrect rootfs entry on /etc/fstab:
 /dev/sda1 / xfs defaults,noatime 1 1
This causes boot error when updated to new kernel.
(see:
https://github.com/scylladb/scylla/issues/1597#issuecomment-250243187)

So replaced the entry to
 UUID=<uuid>  / xfs defaults,noatime 1 1
Also all recent security updates applied.

Fixes #1597
Fixes #1707

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475094957-9464-1-git-send-email-syuu@scylladb.com>
2016-10-20 11:48:24 +03:00
Takuya ASADA
5f602752a5 dist/ubuntu: backport g++-5 from Debian 9(stretch) to Debian 8(jessie)
Since Debian 8(jessie) does not provides g++-5, we frequently got compile error
because we are using older compiler.
To fix the problem, backport g++-5 from Debian 9(stretch).

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-3-git-send-email-syuu@scylladb.com>
2016-10-20 11:41:02 +03:00
Takuya ASADA
7d67504b56 dist/ubuntu: use VERSION_ID from /etc/os-release instead of 'lsb_release -r'
On Debian, lsb_release -r returns the version number something like '8.6'.
However, on this script we want to check major version only.
Therefore we can use VERSION_ID from /etc/os-release which only contains
major version number.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-2-git-send-email-syuu@scylladb.com>
2016-10-20 11:41:02 +03:00
Avi Kivity
0da2f64cfb Merge seastsar upstream
* seastar ccd8649...69acec1 (2):
  > app/iotune: add --smp option
  > rpc: Add missing adjustment of snd_buf::size

Fixes #1767.
Fixes #1768.
2016-10-20 11:16:40 +03:00
Paweł Dziepak
210a390892 tests: add missing sstable for partition skipping test
Commit 7dcd70124a "tests/sstables: add
test for fast forwarding reader" added a test for skipping parts of
sstable. Unfortunately, it did not include the sstables it was trying to
read.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 23:23:49 +01:00
Glauber Costa
1578d7363a commitlog: rework blocking logic
The current incarnation of commitlog establishes a maximum amount of
writes that can be in-flight, and blocks new requests after that limit
is reached.

That is obviously something we must do, but the current approach to it
is problematic for two main reasons:

1) It forces the requests that trigger a write to wait on the current
   write to finish. That is excessive; ideally we would wait for one
   particular write to finish, not necessarily the current one. That
   is made worse by the fact that when a write is followed by a flush
   (happens when we move to a new segment), then we must wait for
   *all* writes in that segment to finish.

1) it casts concurrency in terms of writes instead of memory, which
   makes the aforementioned problem a lot worse: if we have very big
   buffers in flight and we must wait for them to finish, that can
   take a long time, often in the order of seconds, causing timeouts.

The approach taken by this patch is to replace the _write_semaphore
with a request_controller. This data structure will account the amount
of memory used by the buffers and set a limit on it. New allocations
will be held until we go below that limit, and will be released
as soon as this happens.

This guarantees that the latencies introduced by this mechanism are
spread out a lot better among requests and will keep higher percentile
latencies in check.

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:56:36 -04:00
Glauber Costa
aec724bbda commitlog: factor out code for checking mutation size
In a subsequent patch, I'll use this code in a different place. To
prepare for that, we move it out as a method. It also fits a lot better
inside the segment manager, so move it there.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
a50996f376 commitlog: calculate segment-independent size of mutations
Goal is to calculate a size that is lesser or equal than the
segment-dependent size.

This was originally written by Tomasz, and featured in his submission
"commitlog: Handle overload more gracefully"

Extracted here so it sits clearly in a different patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
0b7c9fa17f commitlog: remove _needed_size
It is mostly an optimization, and while it makes sense in this context,
it won't soon as we'll stop waiting for the current cycle specifically
to finish.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
6214bdeb66 commitlog: move segment_manager constructor outside the class definition
We'll do that so we can, in following patches, use static members from
the segment. Those are not defined at this point.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
299877f432 commitlog: add a counter for pending allocations
We track the amount of pending allocations but we don't really export
it. It will be crucial when we stop tracking pending writes.

This patch exports it through a method instead of the totals structure,
so we can easily change it. Current code probing pending_allocations
(the api code) is also converted to use the public method instead of the
totals struct.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Avi Kivity
07c995ab3d Merge "Fast forward mutation readers" from Paweł
"This patchset enables mutation readers to be fast forwarded to a different
partition range. The main reason for introducing such feature are range
queries served from cache. If the cache is partially populated in the
requested range the reader will end up with multiple subranges that have
to be read from the sstables. Originally, each of these subranges would
require a new reader to be created, but with fast forwarding we can have
just one sstable reader. This is better since there is a chance that buffers
kept by the reader may be still useful after fast forwarding it.

In this series there are also patches that clean up cache readers in order
to make integration with fast forwarding easier. Namely, continuity flag is
changed to store information about range before the entry which significantly
simplifies the logic.

Fixes #1299."

* 'pdziepak/fast-forward-mutation-readers/v5' of github.com:cloudius-systems/seastar-dev: (24 commits)
  sstables: keep separate stream history for single and range reads
  sstables: drop sstable::{lower, upper}_bound()
  row_cache: rework cache to use fast forwarding reader
  row_cache: put cache entry flags in a struct
  row_cache: add do_find_or_create_entry() to reduce code duplication
  mutation_reader: forward fast_forward_to() calls
  tests/row_cache: add fast_forward_to() to throttled reader
  tests/row_cache: count mutations read from _underlying
  memtable: add support for fast_forward_to()
  drop key readers
  tests/mutation_reader: test fast forwarding combined reader
  database: enable fast forwarding of range_sstable_reader
  combined_mutation_reader: implement fast_forward_to()
  mutation_reader: make combinded_reader public
  tests/sstables: add test for fast forwarding reader
  tests: add more helpers to mutation reader assertions
  sstables: enable fast forwarding for range readers
  mutation_reader: introduce fast_forward_to()
  sstables: implement mutation_reader::impl::fast_forward_to()
  sstables: introduce index_reader
  ...
2016-10-19 18:10:44 +03:00
Paweł Dziepak
ab0eeae82d sstables: keep separate stream history for single and range reads
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
20bfa1fa52 sstables: drop sstable::{lower, upper}_bound()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5ff699e09f row_cache: rework cache to use fast forwarding reader
This uncomfortably large patch overhauls cache range reader so that it
can take advantage of fast forwarding mutation readers.

A significant change in the cache itself is that the continuity flag now
is used to determine whether cache is contiguous between the previous
entry and the current one. This allows for a significant simplification
of the cache code and easier integration with reader fast forwarding.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
18acb0c0e6 row_cache: put cache entry flags in a struct
Flags are easier to manage if they are in a single structure.
Especially, default initialization and move contstructors are simpler
and less error prone.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f248e23db5 row_cache: add do_find_or_create_entry() to reduce code duplication
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
bcd374c05d mutation_reader: forward fast_forward_to() calls
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0c24bbe639 tests/row_cache: add fast_forward_to() to throttled reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
69645455f3 tests/row_cache: count mutations read from _underlying
Originally, cache tests checked how many times a mutation reader was
created from the underlying mutation source to determine whether
continuity flag is working correctly.

This is not going to work with fast forwarding mutation readers so the
test is switched to count number of mutations (+ end of stream markers)
returned from underlying mutaiton readers which is much less fragile.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
e14f8027d5 memtable: add support for fast_forward_to()
Fast forwarding of memtable readers is needed only for unit tests which
often use memtables as underlying data source for cache and the cache
readers.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
6755a679f6 drop key readers
key_readers weren't used since introduction of continuity flag to cache
entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5ac9babe97 tests/mutation_reader: test fast forwarding combined reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7bebfb851f database: enable fast forwarding of range_sstable_reader
When fast forwarding a reader that combines sstable reader we must also
remember that the set of sstables for the new range may be different
than for the previous one. The reader introduced in this patch makes
sure that we read from correct sstables.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
b7b7b2bd63 combined_mutation_reader: implement fast_forward_to()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2c0cdd55fc mutation_reader: make combinded_reader public
We want to be able to fast forward sstable readers. However, just
implementing fast_forward_to() for combined_reader is not enough as the
sstables we are reading from may need to change.

Following patches are going to introduce a combined sstable reader that
derives from combined_reader. To make that possible we first need to
make combined_reader public.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7dcd70124a tests/sstables: add test for fast forwarding reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5534dc2817 tests: add more helpers to mutation reader assertions
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
cf024975fe sstables: enable fast forwarding for range readers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
62c9492d33 mutation_reader: introduce fast_forward_to()
This patch introduces the interface for fast forwarding mutation
readers. The main user of this feature is going to be cache which, while
serving range query, may need to read multiple small ranges from the
sstables to populate itself with the missing entries.

Fast forwarding is an alternative to recreating a reader with different
range. Its main advantage is fact that it avoids dropping data that has
already been read.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
c63e88d556 sstables: implement mutation_reader::impl::fast_forward_to()
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
a530762277 sstables: introduce index_reader
index_reader is a helper that implements index lookups. Its goal is to
avoid dropping read buffers if they still may be needed (for example to
get end bound of the range or after fast forwarding the reader).

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f49a9e0d64 sstables: drop unused read_range_rows() overload
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0bc873ace5 sstables: add fast_forward_to() to continuous_data_consumer
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
25b91c51e2 ssables: add data_consume_rows_context::reset()
reset() is going to be used to restore valid state after fast forwarding
the reader.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2124d08b88 sstables: add skip() to compressed_file_data_source
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
54069162f5 Merge "Add test for partition version list consistency after compaction" from Tomek 2016-10-18 11:03:25 +01:00
Tomasz Grabiec
308434f891 tests: memtable: Add test for partition version list consistency after compaction 2016-10-18 11:57:14 +02:00
Tomasz Grabiec
6548132423 lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
2016-10-18 11:16:08 +02:00
Tomasz Grabiec
ecf85cbffb mutation: Define + operation
It's more convenient to write m1 + m2 in tests than to do more
elaborate constructs with copy constructors and apply().
2016-10-18 11:16:08 +02:00
Tomasz Grabiec
fe387f8ba0 partition_version: Fix corruption of partition_version list
The move constructor of partition_version was not invoking move
constructor of anchorless_list_base_hook. As a result, when
partition_version objects were moved, e.g. during LSA compaction, they
were unlinked from their lists.

This can make readers return invalid data, because not all versions
will be reachable.

It also casues leaks of the versions which are not directly attached
to memtable entry. This will trigger assertion failure in LSA region
destructor. This assetion triggers with row cache disabled. With cache
enabled (default) all segments are merged into the cache region, which
currently is not destroyed on shutdown, so this problem would go
unnoticed. With cache disabled, memtable region is destroyed after
memtable is flushed and after all readers stop using that memtable.

Fixes #1753.
Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>
2016-10-18 09:25:38 +01:00
Duarte Nunes
1d45f19c78 create_view_statement: Use cf_properties
This patch uses cf_properties instead to add the missing attributes to
the create_view_statement class.

Fixes #1766

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
7c58b7e764 unimplemented: Add materialized views
This patch adds the VIEWS element to the cause enum so we can
mark failures due to incomplete support of materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
7c28ed3dfc schema: Extract default compressor
This patch extracts the definition of the default compressor into the
compression_parameters class, so that the table and view creation
statements don't have to explicitly deal with it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
dc470e6a36 cql3: Extract cf_properties
This patch extracts the cf_properties class, which contains common
attributes of tables and materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:51 +00:00
Takuya ASADA
587d375e19 main: exit with 1 when verify_seastar_io_scheduler() failed
Since we are exiting Scylla process in engine().at_exit() using
::_exit(0), even verify_seastar_io_scheduler() throwing an exception,
scylla always exit with 0.

Systemd misunderstands scylla-server.service was shutdown successfully
because of this, so we need to pass correct exit code to ::_exit() here.

Fixes #1674

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>
2016-10-17 13:57:00 +03:00
Avi Kivity
163088c6af Merge seastar upstream
* seastar 207bf3d...ccd8649 (3):
  > Merge "Augment semaphore with non-blocking operations" from Glauber
  > Merge "More dynamic fstream patches" from Paweł
  > Merge "fstream: add dynamic adjustments based on stream history" from Paweł
2016-10-17 12:49:17 +03:00
Avi Kivity
65c27ccf21 bytes_ostream: make max_chunk_size() an inline function
Fixes debug build looking for a variable definition and not finding it.
2016-10-17 11:49:33 +03:00
Avi Kivity
c0a1ad0b77 bytes_ostream: use larger allocations
A 1MB response will require 2000 allocations with the current 512-byte
chunk size.  Increase it exponentially to reduce allocation count for
larger responses (still respecting the upper limit).
Message-Id: <1476369152-1245-1-git-send-email-avi@scylladb.com>
2016-10-16 10:05:48 +01:00
Tomasz Grabiec
d836e8f64b tests: memtable: Add tests for flushing reader
Message-Id: <1476454187-11462-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:11:06 +01:00
Tomasz Grabiec
63784fd921 db: Fix corruption of partition_entry
Memory accounting code was attaching partition_snapshot to
partition_entry in order to calculate the size of partition_version
object. However, it is only allowed if partition_entry doesn't have
any snapshot attached already. In this case it always has one, created
by the flushing reader.

Change the accounting code to reuse existing partition_snapshot reference.

Fixes #1746
Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:10:48 +01:00
Paweł Dziepak
d08cffd3c7 lsa: avoid exceptions during segment_zone creation
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.

This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.

Fixes #1394.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>
2016-10-14 11:08:24 +02:00
Amnon Heiman
7829da13b4 scylla_setup: Reorder questions and actions
The expected behaviour in the scylla_setup script is that a question
will be followed by the answer.

For example, after asking if the scylla should be run as a service the
relevant actions will be taken before the following question.

This patch address two such mis-orders:
1. the scylla-housekeeping depends on the scylla-server, but the
setup should first setup the scylla-server service and only then ask
(and install if needed) the scylla-housekeeping.
2. The node_exporter should be placed after the io_setup is done.

Fixes #1739

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1476370098-25617-1-git-send-email-amnon@scylladb.com>
2016-10-13 18:29:36 +03:00
Pekka Enberg
3b4e6cdc5e abstract_replication_strategy: Fix exception type if class not found
Change abstract_replication_strategy::create_replication_strategy() to
throw exceptions::configuration_error if replication strategy class
lookup to make sure the error is converted to the correct CQL response.

Fixes #1755

Message-Id: <1476361262-28723-1-git-send-email-penberg@scylladb.com>
2016-10-13 17:39:28 +03:00
Tomasz Grabiec
e617bcd8a7 logalloc: disable abort on allocation failure in places in which it is benign
Some places start big expecting allocation failure, then reduce the
requested size. Let's not abort in such cases.

Message-Id: <1476295120-32047-1-git-send-email-tgrabiec@scylladb.com>
2016-10-13 10:53:32 +03:00
Avi Kivity
13e9d4c8e3 Merge seastar upstream
* seastar f937fb0...207bf3d (11):
  > Merge "iotune: gracefully exit on predictable exceptions" (Fixes #1623)
  > core/semaphore: Add semaphore_units::release()
  > Merge "rometheus API with grafana uses labels" from Amnon
  > core/thread: Fix stack alloc-dealloc mismatch
  > core/thread: Make jmp_buf_link::yield_at use the same time point as thread_scheduling_group
  > file: support for XFS on older kernels
  > reactor: fix bug when handling EBADF in flush_pending_aio()
  > prometheus CPU should start in 0
  > Collectd: bytes ordering depends on the type
  > tests: Check that backtrace() doesn't corrupt signal mask
  > core/thread: Add stack guards to seastar thread stacks
2016-10-12 23:47:12 +03:00
Avi Kivity
63f053e9b7 storage_proxy: fix mutation reordering with wrapping ranges
If we have a range query involving a wrapping range (i.e., from thrift),
and mutations from both halves of the result are involved, then
we will return the results in the wrong order (and potentially the wrong
partitions) since we order by token, so the results from the second half
of the wrapping range end up before the first.

Fix by splitting the two queries, and merging the second half with lower
priority compared to the first half.

Note: this will be fixed in a better way once we have the sharding iterator,
as then we can query sequentially.

Fixes #1761.
Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>
2016-10-12 15:59:16 +02:00
Avi Kivity
1506b06617 Merge "node_exporter service on ubuntu 16" from Amnon
"This series address two issues that interfere with running the node_exporter as a service in ubuntu 16.
1. The service file should be packed in the deb file
2. When setting the node_exporter as a service it doesn't need to run with scylla use"

* 'amnon/node_exporter_ubuntu_v2' of github.com:cloudius-systems/seastar-dev:
  node-exporter service: No need to run as scylla user
  debian package: Include the node_exporter service file
2016-10-12 12:11:18 +03:00
Amnon Heiman
1bd50789e0 node-exporter service: No need to run as scylla user
the node-exporter does not need to run as scylla user.  It can run
without scylla or without the scylla user being configure.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-11 12:44:27 +03:00
Amnon Heiman
d523bf56ed debian package: Include the node_exporter service file
This will include the node_exporter service script for ubuntu
distribution with systemd support.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-11 12:44:14 +03:00
Avi Kivity
f6998bb260 Merge "Implement describe_splits_ex based on Cassandra" from Duarte
"This patch-set re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693"

* 'describe-splits/v2' of github.com:duarten/scylla:
  thrift: Implement describe_splits_ex based on Cassandra
  storage_service: Implement get_splits() function
  sstables: Add function to get key samples
  sstables/key: Add to_partition_key function
  size_estimates_recorder: Increase estimate accuracy
  sstables: Get estimates for a particular range
  sstables/key: Make key::kind public
2016-10-11 11:13:35 +03:00
Takuya ASADA
0007f2d838 dist/common/sbin: add scylla_cpuset_setup and scylla_dev_mode_setup to /usr/sbin
We haven't added symlinks to /usr/sbin for newly created scripts, so add them.

Fixes #1702

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1474879711-31793-1-git-send-email-syuu@scylladb.com>
2016-10-11 11:02:14 +03:00
Takuya ASADA
ccad720bb1 dist/common/script/scylla_io_setup: handle comma correctly when parsing cpuset
The script mistakenly split value at "," when cpuset list is separated
by comma. Instead of matching possible patterns of the argument, let's
pass all characters until reach to space delimiter or end of line.

Fixes #1716

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476171037-32373-1-git-send-email-syuu@scylladb.com>
2016-10-11 10:42:32 +03:00
Duarte Nunes
d8cfc56376 thrift: Implement describe_splits_ex based on Cassandra
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 22:32:10 +02:00
Duarte Nunes
01ab2081cd storage_service: Implement get_splits() function
This patch implements the get_splits() function in storage_service,
used to split a particular token range in slices of approximately the
specified size, using the sample keys and estimates of the CF's
sstables.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 22:32:08 +02:00
Duarte Nunes
c36dbaf0f1 sstables: Add function to get key samples
This patch implements the get_key_samples() function, on which a
future patch will base an implementation of the describe_splits()
thrift verb closer to Cassandra's.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:14 +02:00
Duarte Nunes
fc07b66678 sstables/key: Add to_partition_key function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:11 +02:00
Duarte Nunes
c19c633299 size_estimates_recorder: Increase estimate accuracy
This patch uses the estimated_keys_for_range() function to get better
estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:16 +02:00
Duarte Nunes
ceed09b23e sstables: Get estimates for a particular range
This patch adds the estimated_keys_for_range() function, which
estimates the number of keys present between the specified range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:15 +02:00
Duarte Nunes
8c223b31c8 sstables/key: Make key::kind public
Needed to create synthetic keys without any value but with ordering
properties.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:47:24 +02:00
Avi Kivity
b305d92a65 Merge "housekeeping: check version during setup" from Amnon
"The version is taken from the installation rather than the API, a mode command
line indicated that this is part of the setup and uuid is used for the
interaction with the checkversion server."

* 'amnon/check_version_on_startup_v3' of github.com:cloudius-systems/seastar-dev:
  scylla_setup: Check and report the scylla version
  scylla-housekeeping: check version during setup
2016-10-10 16:37:14 +03:00
Vlad Zolotarov
ab748e829d docs: tracing.md: initial commit
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1475686745-20383-1-git-send-email-vladz@cloudius-systems.com>
2016-10-10 16:12:02 +03:00
Tomasz Grabiec
4357d0a6d9 db: Add counter for writes blocked on dirty memory
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.

total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.

Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
2016-10-10 14:25:22 +03:00
Pekka Enberg
3b75ff1496 docs/docker: Tag --listen-address as 1.4 feature
The Docker Hub documentation is the same for all image versions. Tag
`--listen-address` as 1.4 feature.

Message-Id: <1475819164-7865-1-git-send-email-penberg@scylladb.com>
2016-10-10 13:26:16 +03:00
Vlad Zolotarov
006999f46c api::storage_service::slow_query: don't use duration_cast in GET
The slow_query_record_ttl() and slow_query_threshold() return the duration
of the appropriate type already - no need for an additional cast.
In addition there was a mistake in a cast of ttl.

Fixes #1734

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1475669400-5925-1-git-send-email-vladz@cloudius-systems.com>
2016-10-09 18:09:13 +03:00
Takuya ASADA
469e9af1f4 dist/common/scripts/scylla_setup: use 'swapon -s' instead of 'swapon --show'
Since Ubuntu 14.04 doesn't supported --show option, we need to prevent use it.
Fixes #1740

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475788340-22939-2-git-send-email-syuu@scylladb.com>
2016-10-09 18:05:14 +03:00
Takuya ASADA
8452045b85 dist/ubuntu: add realpath to dependency, requires for scylla_setup
We need dependency to realpath, since scylla_setup using it.

Fixes #1740.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475788340-22939-1-git-send-email-syuu@scylladb.com>
2016-10-09 18:05:14 +03:00
Tomasz Grabiec
41e66ebce2 gdb: Introduce 'scylla heapprof'
Presents current heap profile recording.

Works in text mode or dumps to collapsed stacks format from which
flame graph can be generated.

To generate a flamegraph:

  (gdb) scylla heapprof --flame
  Wrote heapprof.stacks

  $ flamegraph.pl --colors mem < heapprof.stacks > heapprof.svg

flamegraph.pl comes from:

  https://github.com/brendangregg/FlameGraph.git

Text mode example:

(gdb) scylla heapprof --min 100000000
All (274699676, #10213)
 \-- void* memory::cpu_pages::allocate_large_and_trim<memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}>(unsigned int, memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}) + 169  (268435456, #1)
     memory::allocate_large_aligned(unsigned long, unsigned long) + 87
     memory::allocate_aligned(unsigned long, unsigned long) + 48
     aligned_alloc + 9
     logalloc::segment_zone::segment_zone() + 304
     logalloc::segment_pool::allocate_segment() + 477
     logalloc::segment_pool::segment_pool() + 304
     __tls_init.part.801 + 72
     logalloc::region_group::release_requests() + 1333
     logalloc::region_group::add(logalloc::region_group*) + 514

The branches are formatted like this:

   -- <symbol> (<size>, #<count>)

Where <size> is total size of live objects and <count> is total
number of live objects, for all objects allocated from paths going
through this node.

Nodes which share the same <size> and <count> are stacked like this:

   -- <symbol_1> (<size>, #<count>)
      <symbol_2>
      <symbol_3>

Message-Id: <1475583334-19524-1-git-send-email-tgrabiec@scylladb.com>
2016-10-09 10:54:08 +03:00
Glauber Costa
33e9c2bbdd memtable: reduce sstable flush concurrency to one
Limiting the concurrency of memtable flushes to 4 was a temporary
workaround for the fact that we lacked good write behind support. Now
that write behind is properly merged we can reduce the concurrency to
what it should be, one.

This means that memtable flushes will now be serialized, and only when
one of them ends will the next one begin. Disk parallelism is obtained
through the write-behind mechanism.

Fixes #1373

Signed-off-by: Glauber Costa <glauber@scylladb.com>

Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>
2016-10-09 10:48:57 +03:00
Tomasz Grabiec
2a5a90f391 db: Do not timeout streaming readers
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.

This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.

The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.

Fixes #1741.

Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
2016-10-07 15:41:04 +03:00
Raphael S. Carvalho
9175977a9d cql3: fix build failure by defining out unused function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <cba6207278ea945ee750d78b189320443843a288.1475793747.git.raphaelsc@scylladb.com>
2016-10-07 08:45:18 +03:00
Avi Kivity
9ac441d3b5 range: adjust split_after to allow split_point outside input range
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range.  If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.

"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.

This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
2016-10-06 17:54:44 +02:00
Raphael S. Carvalho
7ea4513595 database: trigger compaction after loading new sstables
Scylla wasn't trying to compact new sstables uploaded via 'nodetool
refresh'. Thus, all new sstables were left uncompacted until user
issued 'nodetool flush' or a new sstable was written which would
trigger compaction too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:49 +03:00
Raphael S. Carvalho
9c59ccc52a storage_service: improve log message for refresh
'No new SSTables were found for keyspace1.standard1' was printed
if user uploaded new sstables to upload dir instead, and that is
confusing. We should instead print that if new sstables weren't
found in both cf and cf/upload dirs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <90386f6255407697434213227ae7ff0de7464f99.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:32 +03:00
Raphael S. Carvalho
76862d0d9c main: start compaction procedure after commit log is replayed
Commit log replay is a synchronous operation in bootstrap, so services
will only be started after it's completed. By starting compaction before,
less bandwidth will be available to both and consequently boot will be
slowed down. Fix is simply about moving compaction, which is an
asynchronous operation after commitlog replay is over.

Fixes #1620.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d2a173a4ee4d474317b970c6b39530e61067fea9.1475527955.git.raphaelsc@scylladb.com>
2016-10-06 18:25:24 +03:00
Nadav Har'El
ee7ec10b11 CQL parser: "CREATE MATERIALIZED VIEW" statement
This patch adds the parsing for the "CREATE MATERIALIZED VIEW" statement,
following Cassandra 3 syntax. For example:

   CREATE MATERIALIZED VIEW building_by_city
   AS SELECT * FROM buildings
   WHERE city IS NOT NULL
   PRIMARY KEY(city, name);

It also adds the "IS NOT NULL" operator needed for this purpose.
As in Cassandra, "IS NOT NULL" can only be used for materialized
view creation, and not in a normal SELECT. It can only be used with
the NULL operand (i.e., "IS NOT 3" will be a syntax error).

The current implementation of this statement just does some sanity
checking (such as to verify that "city" is a valid column name and that
the "building" base table exists), complains that materialized views are
not yet supported:

SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="Failed parsing statement: [CREATE MATERIALIZED VIEW building_by_city AS
SELECT * FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);] reason: unsupported operation: Materialized views not yet supported">

As mentioned above, the "IS NOT NULL" restriction is not allowed in
ordinary selects not creating a materialized views:

SELECT * FROM buildings WHERE city IS NOT NULL;
InvalidRequest: code=2200 [Invalid query] message="restriction 'city IS NOT null' is only supported in materialized view creation"

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1475742927-30695-1-git-send-email-nyh@scylladb.com>
2016-10-06 15:42:37 +03:00
Glauber Costa
7146776d7c fix sstable tests by not using the flush_reader if no region_group
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.

Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.

If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.

One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.

The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
2016-10-05 12:44:21 +01:00
Avi Kivity
c94fb1bf12 build: reduce inclusions of messaging_service.hh
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.

Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
2016-10-05 11:46:49 +03:00
Avi Kivity
f8118d9fc2 Merge "Virtual dirty memory management" from Glauber
"Description:
============

Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Results
=======

With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.

In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.

For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ

Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed.  I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.

Known Issues:
-------------

A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.

After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."

* 'virtual-dirty-v6' of github.com:glommer/scylla:
  database: allow virtual dirty memory management
  streamed_mutation: make _buffer private
  add accounting of memory read to partition_snapshot_reader
  move partition_snapshot_reader code to header file
  LSA: allow a group to query its own region group
  memtables: split scanning reader in two
  sstables: use special reader for writing a memtable
  LSA: export information about object memory footprint
  LSA: export information about size of the throttle queue
  database: export virtual dirty bytes region group
2016-10-04 20:57:52 +03:00
Avi Kivity
cc33c8b4ba Merge seastar upstream
* seastar 18f7bb8...f937fb0 (5):
  > Merge "Fix signal mask corruption" from Tomasz
  > core/memory: Avoid violating strict aliasing when accessing allocation sites
  > core/memory: Avoid indirection when storing allocation sites
  > core/memory: Add a way to disable abort on allocation failure in some scope
  > core/sharded: Allow mapper to take the service by non-const reference
2016-10-04 20:08:57 +03:00
Glauber Costa
f89a67c75c database: allow virtual dirty memory management
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
7b6e8a2526 streamed_mutation: make _buffer private
It is currently protected, but now all users go through
push_mutation_fragment().  So we can safely move its visibility to guarantee
that it stays that way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
1db245b52d add accounting of memory read to partition_snapshot_reader
By default, we don't do any accounting. By specializing this class and providing
an accounter class, we can account how much memory are we reading as we read
through the elements.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
452eb95943 move partition_snapshot_reader code to header file
This is so we can template it without worrying about declaring the
specializations in the .cc file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
86aa0b830d LSA: allow a group to query its own region group
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
eee15578fb memtables: split scanning reader in two
The code that is common will live in its own reader, the iterator_reader.  All
friendly private access to memtable attributes and methods happen through the
iterator reader.

After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
16886eeb96 sstables: use special reader for writing a memtable
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
28e3f2f6ee LSA: export information about object memory footprint
We allocate objects of a certain size, but we use a bit more memory to hold
them.  To get a clerer picture about how much memory will an object cost us, we
need help from the allocator. This patch exports an interface that allow users
to query into a specific allocator to get that information.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Pekka Enberg
c3bebea1ef dist/docker: Add '--listen-address' to 'docker run'
Add a '--listen-address' command line parameter to the Docker image,
which can be used to set Scylla's listen address.

Refs #1723

Message-Id: <1475485165-6772-1-git-send-email-penberg@scylladb.com>
2016-10-04 13:57:55 +03:00
Marius
876775a52c dist/docker/ubuntu: refactored $IP/listen_address
In order to allow Scylla’s docker container to handle multiple network
interfaces, the start-scylla script was refactored:

- `$IP` is now called `$SCYLLA_LISTEN_ADDRESS`, so it is less likely to
   be confused or interfere with other environment variables.
- `$SCYLLA_LISTEN_ADDRESS` now checks its value and also tries to
   resolve a hostname, if no IP was set to it.
- `$SCYLLA_LISTEN_DEVICE` can now be set as environment variable and
   contain any available NIC device name (e.g. `eth0`). The script
   automatically retrieves the IP address from the device.

Usage:

1. With `$SCYLLA_LISTEN_ADDRESS` as IP:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=192.168.1.100 scylladb/scylla`

2. With `$SCYLLA_LISTEN_ADDRESS` as hostname:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=containername.network.lan scylladb/scylla`

3. With `$SCYLLA_LISTEN_DEVICE`:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_DEVICE=eth0 scylladb/scylla`

Message-Id: <20161003151230.67672-1-marius@twostairs.com>
2016-10-04 13:56:55 +03:00
Raphael S. Carvalho
747b42299c database: remove unused code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>
2016-10-04 09:26:43 +03:00
Paweł Dziepak
7599ef6fde query_pager: fix splitting range at the end bound
Currently, the code responsible for calculating ranges for the next
request could produce a wrap-around partition range. For example, if the
original range was (unimportant, A] and the last partition key A then
the output range would be (A, A].

This patch adds checks to make sure that in such cases the range is
removed.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475497244-2790-1-git-send-email-pdziepak@scylladb.com>
2016-10-03 19:33:42 +02:00
Avi Kivity
8747054d10 exceptions: mark function called before construction static
cassandra_exception::prepare_message() is called from derived classes'
constructors before the base cassnadra_exception object is constructed.
This is technically illegal but harmless.  Fix by marking the function
static.

Found by clang.
2016-10-03 16:29:02 +03:00
Calle Wilund
5b815b81b4 auth::password_authenticator: Ensure exceptions are processed in continuation
Fixes #1718 (even more)
Message-Id: <1475497389-27016-1-git-send-email-calle@scylladb.com>
2016-10-03 14:49:59 +02:00
Pekka Enberg
f3cd21c8f1 Merge seastar upstream
* seastar 0e60722...18f7bb8 (1):
  > core/memory: Fix compilation errors
2016-10-03 12:54:38 +03:00
Calle Wilund
d24d0f8f90 auth::password_authenticator: "authenticate" should not throw undeclared excpt
Fixes #1718

Message-Id: <1475487331-25927-1-git-send-email-calle@scylladb.com>
2016-10-03 12:53:30 +03:00
Avi Kivity
a51804eca8 Merge "token_restriction: Deal with minimum tokens" from Duarte
"This patch set ensures we can correctly handle queries
where the minimum token is specified."

* 'min-token/v3' of github.com:duarten/scylla:
  cql_query_test: Add test case for min/max token bounds
  token_restriction: Deal with minimum tokens
  partitioner: Parse token from bytes
2016-10-02 12:32:40 +03:00
Avi Kivity
5071f4c0bf Merge seastar upstream
* seastar 9e1d5db...0e60722 (9):
  > core/memory: Replace assert with bad_alloc in allocate_large()
  > chunked_fifo: avoid direct use of sized operator delete
  > memory: fix build without heap profiler
  > xen: initialize port::_sem
  > Merge "Make input streams skippable" from Paweł
  > semaphore: require explict setting for start value
  > prometheus: remove invalid chars from meric names
  > core/memory: Introduce heap profiler
  > util/backtrace: Mark noexcept if func() doesn't throw
2016-10-02 11:43:22 +03:00
Vlad Zolotarov
7e180c7bd3 tracing: introduce the tracing::global_trace_state_ptr class
This object, similarly to a global_schema_ptr, allows to dynamically
create the trace_state_ptr objects on different shards in a context
of the original tracing session.

This object would create a secondary tracing session object from the
original trace_state_ptr object when a trace_state_ptr object is needed
on a "remote" shard, similarly to what we do when we need it on a remote
Node.

Fixes #1678
Fixes #1647

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>
2016-10-02 11:31:37 +03:00
Amnon Heiman
a83bd900be scylla_setup: Check and report the scylla version
This patch adds a call to the scylla-housekeeping check version during
setup, so a warning will be printed if a newer version is available.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-02 11:11:07 +03:00
Amnon Heiman
5e3ab32365 scylla-housekeeping: check version during setup
This changes are for running scylla during setup.

It contains the following changes:
1. get the current version from the command line (as the syclla does not
run at this stage).
2. It support a mode parameter in the command line to indicate that we
running during the installation.
3. It accept an external uuid that will be used with all interaction
with the check_version server.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-02 11:11:07 +03:00
Takuya ASADA
15b156c9d4 dist/common/scripts/scylla_io_setup: describe how to set developer mode when validation tests failed
Describe how to set developer mode, not to confuse users.
Fixes #1701

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475167584-18092-1-git-send-email-syuu@scylladb.com>
2016-10-02 10:58:38 +03:00
Avi Kivity
58ddfea18f Merge "Fixes for leveled compaction strategy" from Raphael
* 'lcs_fixes' of github.com:raphaelsc/scylla:
  lcs: fix starvation at higher levels
  lcs: fix broken token range distribution at higher levels
2016-10-01 21:34:21 +03:00
Takuya ASADA
9639cc840e dist/redhat: add missing build time dependency for libunwind
There was missing dependency for libunwind, so add it.
Fixes #1722

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475260099-25881-1-git-send-email-syuu@scylladb.com>
2016-09-30 21:33:39 +03:00
Takuya ASADA
c89d9599b1 dist/ubuntu: add missing build time dependency for libunwind
There was missing dependency for libunwind, so add it.
Fixes #1721

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475255706-26434-1-git-send-email-syuu@scylladb.com>
2016-09-30 21:33:21 +03:00
Raphael S. Carvalho
a8ab4b8f37 lcs: fix starvation at higher levels
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)

Fixes #1720.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:49 -03:00
Raphael S. Carvalho
a3bf7558f2 lcs: fix broken token range distribution at higher levels
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.

Fixes #1719.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:16 -03:00
Paweł Dziepak
eb1fcf3ecc query_pagers: fix clustering key range calculation
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.

Refs #1446.
Fixes #1684.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
2016-09-30 17:32:59 +02:00
Tomasz Grabiec
7e25b958ac transport: Extend request memory footprint accounting to also cover execution
CQL server is supposed to throttle requests so that they don't
overflow memory. The problem is that it currently accounts for
request's memory only around reading of its frame from the connection
and not actual request execution. As a result too many requests may be
allowed to execute and we may run out of memory.

Fixes #1708.
Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>
2016-09-30 14:23:14 +01:00
Duarte Nunes
72af476397 cql_query_test: Add test case for min/max token bounds
This patch adds a test case for specifying the minimum and maximum
tokens in a cql3 query.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:45:45 +00:00
Duarte Nunes
98b4814894 token_restriction: Deal with minimum tokens
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref #1139
Ref #693
Fixes #1717

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:08 +00:00
Duarte Nunes
862f51cddf partitioner: Parse token from bytes
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:02 +00:00
Duarte Nunes
0c8f280af7 partition_key_view: Implement operator<<
The operator is declared, but it isn't implemented. This patch fixes
that.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1475225647-3800-1-git-send-email-duarte@scylladb.com>
2016-09-30 10:54:54 +02:00
Duarte Nunes
a36888f3cb storage_service: Convert token through partitioner
This patch ensures we use the partitioner to convert a token to
sstring instead of casting.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1475179683-28552-1-git-send-email-duarte@scylladb.com>
2016-09-30 10:54:26 +02:00
Glauber Costa
f5fd6bd714 LSA: export information about size of the throttle queue
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.

This counter is named for consistency after similar counters from transport/.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
Glauber Costa
aa6a96d09b database: export virtual dirty bytes region group
Currently, we export the region group where memtables are placed as dirty bytes.
Upcoming patches will optimistically mark some bytes in this region as free, a
scheme we know as "virtual dirty".

We are still interested in knowing the real state of the dirty region, so we
will keep track of the bytes virtually freed and split the counters in two.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
546 changed files with 29317 additions and 8958 deletions

11
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,11 @@
# Asking questions or requesting help
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) for general questions and help.
# Reporting an issue
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
# Contributing Code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](http://www.scylladb.com/opensource/cla/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.

View File

@@ -83,14 +83,6 @@ Run the image with:
docker run -p $(hostname -i):9042:9042 -i -t <image name>
```
## Contributing to Scylla
Do not send pull requests.
Send patches to the mailing list address scylladb-dev@googlegroups.com.
Be sure to subscribe.
In order for your patches to be merged, you must sign the Contributor's
License Agreement, protecting your rights and ours. See
http://www.scylladb.com/opensource/cla/.
[Guidelines for contributing](CONTRIBUTING.md)

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=1.4.3
VERSION=1.7.5
if test -f version
then
@@ -10,7 +10,12 @@ else
DATE=$(date +%Y%m%d)
GIT_COMMIT=$(git log --pretty=format:'%h' -n 1)
SCYLLA_VERSION=$VERSION
SCYLLA_RELEASE=$DATE.$GIT_COMMIT
# For custom package builds, replace "0" with "counter.your_name",
# where counter starts at 1 and increments for successive versions.
# This ensures that the package manager will select your custom
# package over the standard release.
SCYLLA_BUILD=0
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
fi
echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"

View File

@@ -397,6 +397,36 @@
}
]
},
{
"path": "/cache_service/metrics/key/hits_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get key hits moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_key_hits_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/key/requests_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get key requests moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_key_requests_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/key/size",
"operations": [
@@ -607,6 +637,36 @@
}
]
},
{
"path": "/cache_service/metrics/counter/hits_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get counter hits moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_counter_hits_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/counter/requests_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get counter requests moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_counter_requests_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/counter/size",
"operations": [

View File

@@ -78,11 +78,19 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"split_output",
"description":"true if the output of the major compaction should be split in several sstables",
"required":false,
"allowMultiple":false,
"type":"bool",
"paramType":"query"
}
]
}
@@ -102,7 +110,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -129,7 +137,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -153,7 +161,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -180,7 +188,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -204,7 +212,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -244,7 +252,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -271,7 +279,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -298,7 +306,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -317,7 +325,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -349,7 +357,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -381,7 +389,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -405,7 +413,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -432,7 +440,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -459,7 +467,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -491,7 +499,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -518,7 +526,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -545,7 +553,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -569,7 +577,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -593,7 +601,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -633,7 +641,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -673,7 +681,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -713,7 +721,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -753,7 +761,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -793,7 +801,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -833,7 +841,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -873,7 +881,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -916,7 +924,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -943,7 +951,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -970,7 +978,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -994,7 +1002,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1034,7 +1042,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1058,7 +1066,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1101,7 +1109,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1144,7 +1152,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1203,7 +1211,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1243,7 +1251,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1267,7 +1275,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1310,7 +1318,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1353,7 +1361,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1412,7 +1420,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1452,7 +1460,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1492,7 +1500,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1532,7 +1540,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1572,7 +1580,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1612,7 +1620,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1652,7 +1660,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1692,7 +1700,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1732,7 +1740,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1772,7 +1780,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1812,7 +1820,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1852,7 +1860,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1892,7 +1900,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1932,7 +1940,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1972,7 +1980,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2012,7 +2020,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2052,7 +2060,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2092,7 +2100,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2116,7 +2124,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2156,7 +2164,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2196,7 +2204,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2236,7 +2244,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2276,7 +2284,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2300,7 +2308,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2324,7 +2332,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2351,7 +2359,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2378,7 +2386,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2405,7 +2413,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2432,7 +2440,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2501,7 +2509,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2525,7 +2533,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2549,7 +2557,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2573,7 +2581,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2597,7 +2605,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2621,7 +2629,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2645,7 +2653,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2669,7 +2677,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2693,7 +2701,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2717,7 +2725,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2741,7 +2749,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2765,7 +2773,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",

View File

@@ -21,8 +21,8 @@
"parameters":[
{
"name":"host",
"description":"The host name",
"required":true,
"description":"The host name. If absent, the local server broadcast/listen address is used",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
@@ -45,8 +45,8 @@
"parameters":[
{
"name":"host",
"description":"The host name",
"required":true,
"description":"The host name. If absent, the local server broadcast/listen address is used",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"

View File

@@ -42,6 +42,25 @@
}
]
},
{
"path":"/failure_detector/endpoint_phi_values",
"operations":[
{
"method":"GET",
"summary":"Get end point phi values",
"type":"array",
"items":{
"type":"endpoint_phi_values"
},
"nickname":"get_endpoint_phi_values",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/failure_detector/endpoints/",
"operations":[
@@ -202,6 +221,20 @@
"description": "The application state version"
}
}
},
"endpoint_phi_value": {
"id" : "endpoint_phi_value",
"description": "Holds phi value for a single end point",
"properties": {
"phi": {
"type": "double",
"description": "Phi value"
},
"endpoint": {
"type": "string",
"description": "end point address"
}
}
}
}
}

View File

@@ -1201,11 +1201,12 @@
],
"parameters":[
{
"name":"non_system",
"description":"When set to true limit to non system",
"name":"type",
"description":"Which keyspaces to return",
"required":false,
"allowMultiple":false,
"type":"boolean",
"type":"string",
"enum": [ "all", "user", "non_local_strategy" ],
"paramType":"query"
}
]

View File

@@ -166,33 +166,36 @@ inline int64_t max_int64(int64_t a, int64_t b) {
* It combine total and the sub set for the ratio and its
* to_json method return the ration sub/total
*/
struct ratio_holder : public json::jsonable {
double total = 0;
double sub = 0;
template<typename T>
struct basic_ratio_holder : public json::jsonable {
T total = 0;
T sub = 0;
virtual std::string to_json() const {
if (total == 0) {
return "0";
}
return std::to_string(sub/total);
}
ratio_holder() = default;
ratio_holder& add(double _total, double _sub) {
basic_ratio_holder() = default;
basic_ratio_holder& add(T _total, T _sub) {
total += _total;
sub += _sub;
return *this;
}
ratio_holder(double _total, double _sub) {
basic_ratio_holder(T _total, T _sub) {
total = _total;
sub = _sub;
}
ratio_holder& operator+=(const ratio_holder& a) {
basic_ratio_holder<T>& operator+=(const basic_ratio_holder<T>& a) {
return add(a.total, a.sub);
}
friend ratio_holder operator+(ratio_holder a, const ratio_holder& b) {
friend basic_ratio_holder<T> operator+(basic_ratio_holder a, const basic_ratio_holder<T>& b) {
return a += b;
}
};
typedef basic_ratio_holder<double> ratio_holder;
typedef basic_ratio_holder<int64_t> integral_ratio_holder;
class unimplemented_exception : public base_exception {
public:

View File

@@ -177,6 +177,20 @@ void set_cache_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cs::get_key_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_key_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_key_size.set(r, [] (std::unique_ptr<request> req) {
// TBD
// FIXME
@@ -280,6 +294,20 @@ void set_cache_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cs::get_counter_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_counter_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_counter_size.set(r, [] (std::unique_ptr<request> req) {
// TBD
// FIXME

View File

@@ -40,13 +40,13 @@ static auto transformer(const std::vector<collectd_value>& values) {
for (auto v: values) {
switch (v._type) {
case scollectd::data_type::GAUGE:
collected_value.values.push(v.u._d);
collected_value.values.push(v.d());
break;
case scollectd::data_type::DERIVE:
collected_value.values.push(v.u._i);
collected_value.values.push(v.i());
break;
default:
collected_value.values.push(v.u._ui);
collected_value.values.push(v.ui());
break;
}
}

View File

@@ -191,8 +191,8 @@ static double update_ratio(double acc, double f, double total) {
return acc;
}
static ratio_holder mean_row_size(column_family& cf) {
ratio_holder res;
static integral_ratio_holder mean_row_size(column_family& cf) {
integral_ratio_holder res;
for (auto i: *cf.get_sstables() ) {
auto c = i->get_stats_metadata().estimated_row_size.count();
res.sub += i->get_stats_metadata().estimated_row_size.mean() * c;
@@ -562,11 +562,13 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), mean_row_size, std::plus<ratio_holder>());
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, req->param["name"], integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());
});
cf::get_all_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, ratio_holder(), mean_row_size, std::plus<ratio_holder>());
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());
});
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

View File

@@ -22,16 +22,22 @@
#include "locator/snitch_base.hh"
#include "endpoint_snitch.hh"
#include "api/api-doc/endpoint_snitch_info.json.hh"
#include "utils/fb_utilities.hh"
namespace api {
void set_endpoint_snitch(http_context& ctx, routes& r) {
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(req.get_query_param("host"));
static auto host_or_broadcast = [](const_req req) {
auto host = req.get_query_param("host");
return host.empty() ? gms::inet_address(utils::fb_utilities::get_broadcast_address()) : gms::inet_address(host);
};
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(host_or_broadcast(req));
});
httpd::endpoint_snitch_info_json::get_rack.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(req.get_query_param("host"));
httpd::endpoint_snitch_info_json::get_rack.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(host_or_broadcast(req));
});
httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [] (const_req req) {

View File

@@ -88,6 +88,20 @@ void set_failure_detector(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(state);
});
});
fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {
return gms::get_arrival_samples().then([](std::map<gms::inet_address, gms::arrival_window> map) {
std::vector<fd::endpoint_phi_value> res;
auto now = gms::arrival_window::clk::now();
for (auto& p : map) {
fd::endpoint_phi_value val;
val.endpoint = p.first.to_sstring();
val.phi = p.second.phi(now);
res.emplace_back(std::move(val));
}
return make_ready_future<json::json_return_type>(res);
});
});
}
}

View File

@@ -22,6 +22,8 @@
#include "storage_service.hh"
#include "api/api-doc/storage_service.json.hh"
#include "db/config.hh"
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <service/storage_service.hh>
#include <db/commitlog/commitlog.hh>
#include <gms/gossiper.hh>
@@ -457,8 +459,15 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_keyspaces.set(r, [&ctx](const_req req) {
auto non_system = req.get_query_param("non_system");
return map_keys(ctx.db.local().keyspaces());
auto type = req.get_query_param("type");
if (type == "user") {
return ctx.db.local().get_non_system_keyspaces();
} else if (type == "non_local_strategy") {
return map_keys(ctx.db.local().get_keyspaces() | boost::adaptors::filtered([](const auto& p) {
return p.second.get_replication_strategy().get_type() != locator::replication_strategy_type::local;
}));
}
return map_keys(ctx.db.local().get_keyspaces());
});
ss::update_snitch.set(r, [](std::unique_ptr<request> req) {
@@ -542,9 +551,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::is_joined.set(r, [] (std::unique_ptr<request> req) {
return service::get_local_storage_service().is_joined().then([] (bool is_joined) {
return make_ready_future<json::json_return_type>(is_joined);
});
return make_ready_future<json::json_return_type>(service::get_local_storage_service().is_joined());
});
ss::set_stream_throughput_mb_per_sec.set(r, [](std::unique_ptr<request> req) {
@@ -664,17 +671,23 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::set_trace_probability.set(r, [](std::unique_ptr<request> req) {
auto probability = req->get_query_param("probability");
try {
return futurize<json::json_return_type>::apply([probability] {
double real_prob = std::stod(probability.c_str());
return tracing::tracing::tracing_instance().invoke_on_all([real_prob] (auto& local_tracing) {
local_tracing.set_trace_probability(real_prob);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
} catch (...) {
throw httpd::bad_param_exception(sprint("Bad format of a probability value: \"%s\"", probability.c_str()));
}
}).then_wrapped([probability] (auto&& f) {
try {
f.get();
return make_ready_future<json::json_return_type>(json_void());
} catch (std::out_of_range& e) {
throw httpd::bad_param_exception(e.what());
} catch (std::invalid_argument&){
throw httpd::bad_param_exception(sprint("Bad format in a probability value: \"%s\"", probability.c_str()));
}
});
});
ss::get_trace_probability.set(r, [](std::unique_ptr<request> req) {

View File

@@ -28,7 +28,7 @@
#include "utils/managed_bytes.hh"
#include "net/byteorder.hh"
#include <cstdint>
#include <iostream>
#include <iosfwd>
template<typename T>
static inline
@@ -57,6 +57,7 @@ private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t REVERT_FLAG = 0x04; // transient flag used to efficiently implement ReversiblyMergeable for atomic cells.
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
@@ -67,6 +68,9 @@ private:
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
private:
static bool is_counter_update(bytes_view cell) {
return cell[0] & COUNTER_UPDATE_FLAG;
}
static bool is_revert_set(bytes_view cell) {
return cell[0] & REVERT_FLAG;
}
@@ -126,6 +130,14 @@ private:
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, bytes_view value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
@@ -149,17 +161,20 @@ protected:
atomic_cell_base(ByteContainer&& data) : _data(std::forward<ByteContainer>(data)) { }
friend class atomic_cell_or_collection;
public:
bool is_counter_update() const {
return atomic_cell_type::is_counter_update(_data);
}
bool is_revert_set() const {
return atomic_cell_type::is_revert_set(_data);
}
bool is_live() const {
return atomic_cell_type::is_live(_data);
}
bool is_live(tombstone t) const {
return is_live() && !is_covered_by(t);
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
}
bool is_live(tombstone t, gc_clock::time_point now) const {
return is_live() && !is_covered_by(t) && !has_expired(now);
bool is_live(tombstone t, gc_clock::time_point now, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return atomic_cell_type::is_live_and_has_ttl(_data);
@@ -167,8 +182,8 @@ public:
bool is_dead(gc_clock::time_point now) const {
return atomic_cell_type::is_dead(_data) || has_expired(now);
}
bool is_covered_by(tombstone t) const {
return timestamp() <= t.timestamp;
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
@@ -239,6 +254,12 @@ public:
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value) {
return make_live(timestamp, bytes_view(value));
}
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, bytes_view value) {
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, const bytes& value) {
return atomic_cell_type::make_live_counter_update(timestamp, bytes_view(value));
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl)
{

View File

@@ -26,16 +26,17 @@
#include "types.hh"
#include "atomic_cell.hh"
#include "hashing.hh"
#include "counters.hh"
template<>
struct appending_hash<collection_mutation_view> {
template<typename Hasher>
void operator()(Hasher& h, collection_mutation_view cell) const {
void operator()(Hasher& h, collection_mutation_view cell, const column_definition& cdef) const {
auto m_view = collection_type_impl::deserialize_mutation_form(cell);
::feed_hash(h, m_view.tomb);
for (auto&& key_and_value : m_view.cells) {
::feed_hash(h, key_and_value.first);
::feed_hash(h, key_and_value.second);
::feed_hash(h, key_and_value.second, cdef);
}
}
};
@@ -43,10 +44,14 @@ struct appending_hash<collection_mutation_view> {
template<>
struct appending_hash<atomic_cell_view> {
template<typename Hasher>
void operator()(Hasher& h, atomic_cell_view cell) const {
void operator()(Hasher& h, atomic_cell_view cell, const column_definition& cdef) const {
feed_hash(h, cell.is_live());
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
::feed_hash(h, counter_cell_view(cell));
return;
}
if (cell.is_live_and_has_ttl()) {
feed_hash(h, cell.expiry());
feed_hash(h, cell.ttl());
@@ -61,15 +66,15 @@ struct appending_hash<atomic_cell_view> {
template<>
struct appending_hash<atomic_cell> {
template<typename Hasher>
void operator()(Hasher& h, const atomic_cell& cell) const {
feed_hash(h, static_cast<atomic_cell_view>(cell));
void operator()(Hasher& h, const atomic_cell& cell, const column_definition& cdef) const {
feed_hash(h, static_cast<atomic_cell_view>(cell), cdef);
}
};
template<>
struct appending_hash<collection_mutation> {
template<typename Hasher>
void operator()(Hasher& h, const collection_mutation& cm) const {
feed_hash(h, static_cast<collection_mutation_view>(cm));
void operator()(Hasher& h, const collection_mutation& cm, const column_definition& cdef) const {
feed_hash(h, static_cast<collection_mutation_view>(cm), cdef);
}
};

View File

@@ -58,13 +58,13 @@ public:
template<typename Hasher>
void feed_hash(Hasher& h, const column_definition& def) const {
if (def.is_atomic()) {
::feed_hash(h, as_atomic_cell());
::feed_hash(h, as_atomic_cell(), def);
} else {
::feed_hash(as_collection_mutation(), h, def.type);
::feed_hash(h, as_collection_mutation(), def);
}
}
size_t memory_usage() const {
return _data.memory_usage();
size_t external_memory_usage() const {
return _data.external_memory_usage();
}
friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);
};

View File

@@ -73,12 +73,14 @@ class auth_migration_listener : public service::migration_listener {
void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_create_function(const sstring& ks_name, const sstring& function_name) override {}
void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_create_view(const sstring& ks_name, const sstring& view_name) override {}
void on_update_keyspace(const sstring& ks_name) override {}
void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}
void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_drop_keyspace(const sstring& ks_name) override {
auth::authorizer::get().revoke_all(auth::data_resource(ks_name));
@@ -89,6 +91,7 @@ class auth_migration_listener : public service::migration_listener {
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}
};
static auth_migration_listener auth_migration;
@@ -243,7 +246,8 @@ future<> auth::auth::setup() {
std::map<sstring, sstring> opts;
opts["replication_factor"] = "1";
auto ksm = keyspace_metadata::new_keyspace(AUTH_KS, "org.apache.cassandra.locator.SimpleStrategy", opts, true);
f = service::get_local_migration_manager().announce_new_keyspace(ksm, false);
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments. See issue #2129.
f = service::get_local_migration_manager().announce_new_keyspace(ksm, api::min_timestamp, false);
}
return f.then([] {
@@ -353,7 +357,7 @@ future<> auth::auth::setup_table(const sstring& name, const sstring& cql) {
parsed->prepare_keyspace(AUTH_KS);
::shared_ptr<cql3::statements::create_table_statement> statement =
static_pointer_cast<cql3::statements::create_table_statement>(
parsed->prepare(db)->statement);
parsed->prepare(db, qp.get_cql_stats())->statement);
auto schema = statement->get_cf_meta_data();
auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());

View File

@@ -38,7 +38,7 @@ class bytes_ostream {
public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
static constexpr size_type max_chunk_size = 16 * 1024;
static constexpr size_type max_chunk_size() { return 16 * 1024; }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
@@ -59,7 +59,6 @@ private:
};
// FIXME: consider increasing chunk size as the buffer grows
static constexpr size_type chunk_size{512};
static constexpr size_type usable_chunk_size{chunk_size - sizeof(chunk)};
private:
std::unique_ptr<chunk> _begin;
chunk* _current;
@@ -100,6 +99,19 @@ private:
}
return _current->size - _current->offset;
}
// Figure out next chunk size.
// - must be enough for data_size
// - must be at least chunk_size
// - try to double each time to prevent too many allocations
// - do not exceed max_chunk_size
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: chunk_size;
next_size = std::min(next_size, max_chunk_size());
// FIXME: check for overflow?
return std::max<size_type>(next_size, data_size + sizeof(chunk));
}
// Makes room for a contiguous region of given size.
// The region is accounted for as already written.
// size must not be zero.
@@ -110,7 +122,7 @@ private:
_size += size;
return ret;
} else {
auto alloc_size = size <= usable_chunk_size ? chunk_size : (size + sizeof(chunk));
auto alloc_size = next_alloc_size(size);
auto space = malloc(alloc_size);
if (!space) {
throw std::bad_alloc();
@@ -205,7 +217,7 @@ public:
}
while (!v.empty()) {
auto this_size = std::min(v.size(), size_t(max_chunk_size));
auto this_size = std::min(v.size(), size_t(max_chunk_size()));
std::copy_n(v.begin(), this_size, alloc(this_size));
v.remove_prefix(this_size);
}
@@ -329,7 +341,7 @@ public:
// if its size is below max_chunk_size. We probably could also gain
// some read performance by doing "real" reduction, i.e. merging
// all chunks until all but the last one is max_chunk_size.
if (size() < max_chunk_size) {
if (size() < max_chunk_size()) {
linearize();
}
}

View File

@@ -22,6 +22,7 @@
#include "canonical_mutation.hh"
#include "mutation.hh"
#include "mutation_partition_serializer.hh"
#include "counters.hh"
#include "converting_mutation_partition_applier.hh"
#include "hashing_partition_visitor.hh"
#include "utils/UUID.hh"
@@ -44,7 +45,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
mutation_partition_serializer part_ser(*m.schema(), m.partition());
bytes_ostream out;
ser::writer_of_canonical_mutation wr(out);
ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())

545
cell_locking.hh Normal file
View File

@@ -0,0 +1,545 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <boost/intrusive/unordered_set.hpp>
#if __has_include(<boost/container/small_vector.hpp>)
#include <boost/container/small_vector.hpp>
template <typename T, size_t N>
using small_vector = boost::container::small_vector<T, N>;
#else
#include <vector>
template <typename T, size_t N>
using small_vector = std::vector<T>;
#endif
#include "fnv1a_hasher.hh"
#include "streamed_mutation.hh"
#include "mutation_partition.hh"
class cells_range {
using ids_vector_type = small_vector<column_id, 5>;
position_in_partition_view _position;
ids_vector_type _ids;
public:
using iterator = ids_vector_type::iterator;
using const_iterator = ids_vector_type::const_iterator;
cells_range()
: _position(position_in_partition_view(position_in_partition_view::static_row_tag_t())) { }
explicit cells_range(position_in_partition_view pos, const row& cells)
: _position(pos)
{
_ids.reserve(cells.size());
cells.for_each_cell([this] (auto id, auto&&) {
_ids.emplace_back(id);
});
}
position_in_partition_view position() const { return _position; }
bool empty() const { return _ids.empty(); }
auto begin() const { return _ids.begin(); }
auto end() const { return _ids.end(); }
};
class partition_cells_range {
const mutation_partition& _mp;
public:
class iterator {
const mutation_partition& _mp;
stdx::optional<mutation_partition::rows_type::const_iterator> _position;
cells_range _current;
public:
explicit iterator(const mutation_partition& mp)
: _mp(mp)
, _current(position_in_partition_view(position_in_partition_view::static_row_tag_t()), mp.static_row())
{ }
iterator(const mutation_partition& mp, mutation_partition::rows_type::const_iterator it)
: _mp(mp)
, _position(it)
{ }
iterator& operator++() {
if (!_position) {
_position = _mp.clustered_rows().begin();
} else {
++(*_position);
}
if (_position != _mp.clustered_rows().end()) {
auto it = *_position;
_current = cells_range(position_in_partition_view(position_in_partition_view::clustering_row_tag_t(), it->key()),
it->row().cells());
}
return *this;
}
iterator operator++(int) {
iterator it(*this);
operator++();
return it;
}
cells_range& operator*() {
return _current;
}
cells_range* operator->() {
return &_current;
}
bool operator==(const iterator& other) const {
return _position == other._position;
}
bool operator!=(const iterator& other) const {
return !(*this == other);
}
};
public:
explicit partition_cells_range(const mutation_partition& mp) : _mp(mp) { }
iterator begin() const {
return iterator(_mp);
}
iterator end() const {
return iterator(_mp, _mp.clustered_rows().end());
}
};
class locked_cell;
class cell_locker {
class partition_entry;
struct cell_address {
position_in_partition position;
column_id id;
};
class cell_entry : public bi::unordered_set_base_hook<bi::link_mode<bi::auto_unlink>>,
public enable_lw_shared_from_this<cell_entry> {
partition_entry& _parent;
cell_address _address;
semaphore _semaphore { 0 };
friend class cell_locker;
public:
cell_entry(partition_entry& parent, position_in_partition position, column_id id)
: _parent(parent)
, _address { std::move(position), id }
{ }
// Upgrades cell_entry to another schema.
// Changes the value of cell_address, so cell_entry has to be
// temporarily removed from its parent partition_entry.
// Returns true if the cell_entry still exist in the new schema and
// should be reinserted.
bool upgrade(const schema& from, const schema& to, column_kind kind) noexcept {
auto& old_column_mapping = from.get_column_mapping();
auto& column = old_column_mapping.column_at(kind, _address.id);
auto cdef = to.get_column_definition(column.name());
if (!cdef) {
return false;
}
_address.id = cdef->id;
return true;
}
const position_in_partition& position() const {
return _address.position;
}
future<> lock() {
return _semaphore.wait();
}
void unlock() {
_semaphore.signal();
}
~cell_entry() {
if (!is_linked()) {
return;
}
unlink();
if (!--_parent._cell_count) {
delete &_parent;
}
}
class hasher {
const schema* _schema; // pointer instead of reference for default assignment
public:
explicit hasher(const schema& s) : _schema(&s) { }
size_t operator()(const cell_address& ca) const {
fnv1a_hasher hasher;
ca.position.feed_hash(hasher, *_schema);
::feed_hash(hasher, ca.id);
return hasher.finalize();
}
size_t operator()(const cell_entry& ce) const {
return operator()(ce._address);
}
};
class equal_compare {
position_in_partition::equal_compare _cmp;
private:
bool do_compare(const cell_address& a, const cell_address& b) const {
return a.id == b.id && _cmp(a.position, b.position);
}
public:
explicit equal_compare(const schema& s) : _cmp(s) { }
bool operator()(const cell_address& ca, const cell_entry& ce) const {
return do_compare(ca, ce._address);
}
bool operator()(const cell_entry& ce, const cell_address& ca) const {
return do_compare(ca, ce._address);
}
bool operator()(const cell_entry& a, const cell_entry& b) const {
return do_compare(a._address, b._address);
}
};
};
class partition_entry : public bi::unordered_set_base_hook<bi::link_mode<bi::auto_unlink>> {
using cells_type = bi::unordered_set<cell_entry,
bi::equal<cell_entry::equal_compare>,
bi::hash<cell_entry::hasher>,
bi::constant_time_size<false>>;
static constexpr size_t initial_bucket_count = 64;
using max_load_factor = std::ratio<3, 4>;
dht::decorated_key _key;
cell_locker& _parent;
size_t _rehash_at_size = compute_rehash_at_size(initial_bucket_count);
std::unique_ptr<cells_type::bucket_type[]> _buckets; // TODO: start with internal storage?
size_t _cell_count = 0; // cells_type::empty() is not O(1) if the hook is auto-unlink
cells_type _cells;
schema_ptr _schema;
friend class cell_entry;
private:
static constexpr size_t compute_rehash_at_size(size_t bucket_count) {
return bucket_count * max_load_factor::num / max_load_factor::den;
}
void maybe_rehash() {
if (_cell_count >= _rehash_at_size) {
auto new_bucket_count = std::min(_cells.bucket_count() * 2, _cells.bucket_count() + 1024);
auto buckets = std::make_unique<cells_type::bucket_type[]>(new_bucket_count);
_cells.rehash(cells_type::bucket_traits(buckets.get(), new_bucket_count));
_buckets = std::move(buckets);
_rehash_at_size = compute_rehash_at_size(new_bucket_count);
}
}
public:
partition_entry(schema_ptr s, cell_locker& parent, const dht::decorated_key& dk)
: _key(dk)
, _parent(parent)
, _buckets(std::make_unique<cells_type::bucket_type[]>(initial_bucket_count))
, _cells(cells_type::bucket_traits(_buckets.get(), initial_bucket_count),
cell_entry::hasher(*s), cell_entry::equal_compare(*s))
, _schema(s)
{ }
~partition_entry() {
if (is_linked()) {
_parent._partition_count--;
}
}
// Upgrades partition entry to new schema. Returns false if all
// cell_entries has been removed during the upgrade.
bool upgrade(schema_ptr new_schema);
void insert(lw_shared_ptr<cell_entry> cell) {
_cells.insert(*cell);
_cell_count++;
maybe_rehash();
}
cells_type& cells() {
return _cells;
}
struct hasher {
size_t operator()(const dht::decorated_key& dk) const {
return std::hash<dht::decorated_key>()(dk);
}
size_t operator()(const partition_entry& pe) const {
return operator()(pe._key);
}
};
class equal_compare {
dht::decorated_key_equals_comparator _cmp;
public:
explicit equal_compare(const schema& s) : _cmp(s) { }
bool operator()(const dht::decorated_key& dk, const partition_entry& pe) {
return _cmp(dk, pe._key);
}
bool operator()(const partition_entry& pe, const dht::decorated_key& dk) {
return _cmp(dk, pe._key);
}
bool operator()(const partition_entry& a, const partition_entry& b) {
return _cmp(a._key, b._key);
}
};
};
using partitions_type = bi::unordered_set<partition_entry,
bi::equal<partition_entry::equal_compare>,
bi::hash<partition_entry::hasher>,
bi::constant_time_size<false>>;
static constexpr size_t initial_bucket_count = 4 * 1024;
using max_load_factor = std::ratio<3, 4>;
std::unique_ptr<partitions_type::bucket_type[]> _buckets;
partitions_type _partitions;
size_t _partition_count = 0;
size_t _rehash_at_size = compute_rehash_at_size(initial_bucket_count);
schema_ptr _schema;
// partitions_type uses equality comparator which keeps a reference to the
// original schema, we must ensure that it doesn't die.
schema_ptr _original_schema;
friend class locked_cell;
private:
struct locker;
static constexpr size_t compute_rehash_at_size(size_t bucket_count) {
return bucket_count * max_load_factor::num / max_load_factor::den;
}
void maybe_rehash() {
if (_partition_count >= _rehash_at_size) {
auto new_bucket_count = std::min(_partitions.bucket_count() * 2, _partitions.bucket_count() + 64 * 1024);
auto buckets = std::make_unique<partitions_type::bucket_type[]>(new_bucket_count);
_partitions.rehash(partitions_type::bucket_traits(buckets.get(), new_bucket_count));
_buckets = std::move(buckets);
_rehash_at_size = compute_rehash_at_size(new_bucket_count);
}
}
public:
explicit cell_locker(schema_ptr s)
: _buckets(std::make_unique<partitions_type::bucket_type[]>(initial_bucket_count))
, _partitions(partitions_type::bucket_traits(_buckets.get(), initial_bucket_count),
partition_entry::hasher(), partition_entry::equal_compare(*s))
, _schema(s)
, _original_schema(std::move(s))
{ }
~cell_locker() {
assert(_partitions.empty());
}
void set_schema(schema_ptr s) {
_schema = s;
}
schema_ptr schema() const {
return _schema;
}
// partition_cells_range is required to be in cell_locker::schema()
future<std::vector<locked_cell>> lock_cells(const dht::decorated_key& dk, partition_cells_range&& range);
};
class locked_cell {
lw_shared_ptr<cell_locker::cell_entry> _entry;
public:
explicit locked_cell(lw_shared_ptr<cell_locker::cell_entry> entry)
: _entry(std::move(entry)) { }
locked_cell(const locked_cell&) = delete;
locked_cell(locked_cell&&) = default;
~locked_cell() {
if (_entry) {
_entry->unlock();
}
}
};
struct cell_locker::locker {
cell_entry::hasher _hasher;
cell_entry::equal_compare _eq_cmp;
partition_entry& _partition_entry;
partition_cells_range _range;
partition_cells_range::iterator _current_ck;
cells_range::const_iterator _current_cell;
std::vector<locked_cell> _locks;
private:
void update_ck() {
if (!is_done()) {
_current_cell = _current_ck->begin();
}
}
future<> lock_next();
bool is_done() const { return _current_ck == _range.end(); }
public:
explicit locker(const ::schema& s, partition_entry& pe, partition_cells_range&& range)
: _hasher(s)
, _eq_cmp(s)
, _partition_entry(pe)
, _range(std::move(range))
, _current_ck(_range.begin())
{
update_ck();
}
locker(const locker&) = delete;
locker(locker&&) = delete;
future<> lock_all() {
// Cannot defer before first call to lock_next().
return lock_next().then([this] {
return do_until([this] { return is_done(); }, [this] {
return lock_next();
});
});
}
std::vector<locked_cell> get() && { return std::move(_locks); }
};
inline
future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range) {
partition_entry::hasher pe_hash;
partition_entry::equal_compare pe_eq(*_schema);
auto it = _partitions.find(dk, pe_hash, pe_eq);
std::unique_ptr<partition_entry> partition;
if (it == _partitions.end()) {
partition = std::make_unique<partition_entry>(_schema, *this, dk);
} else if (!it->upgrade(_schema)) {
partition = std::unique_ptr<partition_entry>(&*it);
_partition_count--;
_partitions.erase(it);
}
if (partition) {
std::vector<locked_cell> locks;
for (auto&& r : range) {
if (r.empty()) {
continue;
}
for (auto&& c : r) {
auto cell = make_lw_shared<cell_entry>(*partition, position_in_partition(r.position()), c);
partition->insert(cell);
locks.emplace_back(std::move(cell));
}
}
if (!locks.empty()) {
_partitions.insert(*partition.release());
_partition_count++;
maybe_rehash();
}
return make_ready_future<std::vector<locked_cell>>(std::move(locks));
}
auto l = std::make_unique<locker>(*_schema, *it, std::move(range));
auto f = l->lock_all();
return f.then([l = std::move(l)] {
return std::move(*l).get();
});
}
inline
future<> cell_locker::locker::lock_next() {
while (!is_done()) {
if (_current_cell == _current_ck->end()) {
++_current_ck;
update_ck();
continue;
}
auto cid = *_current_cell++;
cell_address ca { position_in_partition(_current_ck->position()), cid };
auto it = _partition_entry.cells().find(ca, _hasher, _eq_cmp);
if (it != _partition_entry.cells().end()) {
return it->lock().then([this, ce = it->shared_from_this()] () mutable {
_locks.emplace_back(std::move(ce));
});
}
auto cell = make_lw_shared<cell_entry>(_partition_entry, position_in_partition(_current_ck->position()), cid);
_partition_entry.insert(cell);
_locks.emplace_back(std::move(cell));
}
return make_ready_future<>();
}
inline
bool cell_locker::partition_entry::upgrade(schema_ptr new_schema) {
if (_schema == new_schema) {
return true;
}
auto buckets = std::make_unique<cells_type::bucket_type[]>(_cells.bucket_count());
auto cells = cells_type(cells_type::bucket_traits(buckets.get(), _cells.bucket_count()),
cell_entry::hasher(*new_schema), cell_entry::equal_compare(*new_schema));
_cells.clear_and_dispose([&] (cell_entry* cell_ptr) noexcept {
auto& cell = *cell_ptr;
auto kind = cell.position().is_static_row() ? column_kind::static_column
: column_kind::regular_column;
auto reinsert = cell.upgrade(*_schema, *new_schema, kind);
if (reinsert) {
cells.insert(cell);
} else {
_cell_count--;
}
});
// bi::unordered_set move assignment is actually a swap.
// Original _buckets cannot be destroyed before the container using them is
// so we need to explicitly make sure that the original _cells is no more.
_cells = std::move(cells);
auto destroy = [] (auto) { };
destroy(std::move(cells));
_buckets = std::move(buckets);
_schema = new_schema;
return _cell_count;
}

View File

@@ -27,125 +27,131 @@
class checked_file_impl : public file_impl {
public:
checked_file_impl(disk_error_signal_type& s, file f)
: _signal(s) , _file(f) {
checked_file_impl(const io_error_handler& error_handler, file f)
: _error_handler(error_handler), _file(f) {
_memory_dma_alignment = f.memory_dma_alignment();
_disk_read_dma_alignment = f.disk_read_dma_alignment();
_disk_write_dma_alignment = f.disk_write_dma_alignment();
}
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, buffer, len, pc);
});
}
virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, iov, pc);
});
}
virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, buffer, len, pc);
});
}
virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, iov, pc);
});
}
virtual future<> flush(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->flush();
});
}
virtual future<struct stat> stat(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->stat();
});
}
virtual future<> truncate(uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->truncate(length);
});
}
virtual future<> discard(uint64_t offset, uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->discard(offset, length);
});
}
virtual future<> allocate(uint64_t position, uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->allocate(position, length);
});
}
virtual future<uint64_t> size(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->size();
});
}
virtual future<> close() override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->close();
});
}
// returns a handle for plain file, so make_checked_file() should be called
// on file returned by handle.
virtual std::unique_ptr<seastar::file_handle_impl> dup() override {
return get_file_impl(_file)->dup();
}
virtual subscription<directory_entry> list_directory(std::function<future<> (directory_entry de)> next) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->list_directory(next);
});
}
private:
disk_error_signal_type &_signal;
const io_error_handler& _error_handler;
file _file;
};
inline file make_checked_file(disk_error_signal_type& signal, file& f)
inline file make_checked_file(const io_error_handler& error_handler, file f)
{
return file(::make_shared<checked_file_impl>(signal, f));
return file(::make_shared<checked_file_impl>(error_handler, f));
}
future<file>
inline open_checked_file_dma(disk_error_signal_type& signal,
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags,
file_open_options options)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags, options).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}
future<file>
inline open_checked_file_dma(disk_error_signal_type& signal,
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}
future<file>
inline open_checked_directory(disk_error_signal_type& signal,
inline open_checked_directory(const io_error_handler& error_handler,
sstring name)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return engine().open_directory(name).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}

View File

@@ -62,27 +62,54 @@ public:
: prefix(prefix)
, kind(kind)
{ }
struct compare {
bound_view(const bound_view& other) noexcept = default;
bound_view& operator=(const bound_view& other) noexcept {
if (this != &other) {
this->~bound_view();
new (this) bound_view(other);
}
return *this;
}
struct tri_compare {
// To make it assignable and to avoid taking a schema_ptr, we
// wrap the schema reference.
std::reference_wrapper<const schema> _s;
compare(const schema& s) : _s(s)
tri_compare(const schema& s) : _s(s)
{ }
bool operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
int operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
auto type = _s.get().clustering_key_prefix_type();
auto res = prefix_equality_tri_compare(type->types().begin(),
type->begin(p1), type->end(p1),
type->begin(p2), type->end(p2),
tri_compare);
::tri_compare);
if (res) {
return res < 0;
return res;
}
auto d1 = p1.size(_s);
auto d2 = p2.size(_s);
if (d1 == d2) {
return w1 < w2;
return w1 - w2;
}
return d1 < d2 ? w1 <= 0 : w2 > 0;
return d1 < d2 ? w1 - (w1 <= 0) : -(w2 - (w2 <= 0));
}
int operator()(const bound_view b, const clustering_key_prefix& p) const {
return operator()(b.prefix, weight(b.kind), p, 0);
}
int operator()(const clustering_key_prefix& p, const bound_view b) const {
return operator()(p, 0, b.prefix, weight(b.kind));
}
int operator()(const bound_view b1, const bound_view b2) const {
return operator()(b1.prefix, weight(b1.kind), b2.prefix, weight(b2.kind));
}
};
struct compare {
// To make it assignable and to avoid taking a schema_ptr, we
// wrap the schema reference.
tri_compare _cmp;
compare(const schema& s) : _cmp(s)
{ }
bool operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
return _cmp(p1, w1, p2, w2) < 0;
}
bool operator()(const bound_view b, const clustering_key_prefix& p) const {
return operator()(b.prefix, weight(b.kind), p, 0);

View File

@@ -39,6 +39,9 @@ public:
compatible_ring_position(const schema& s, dht::ring_position&& rp)
: _schema(&s), _rp(std::move(rp)) {
}
const dht::token& token() const {
return _rp->token();
}
friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
return x._rp->tri_compare(*x._schema, *y._rp);
}

View File

@@ -22,7 +22,7 @@
#pragma once
#include "types.hh"
#include <iostream>
#include <iosfwd>
#include <algorithm>
#include <vector>
#include <boost/range/iterator_range.hpp>

View File

@@ -39,17 +39,17 @@ public:
static constexpr auto CHUNK_LENGTH_KB = "chunk_length_kb";
static constexpr auto CRC_CHECK_CHANCE = "crc_check_chance";
private:
compressor _compressor = compressor::none;
compressor _compressor;
std::experimental::optional<int> _chunk_length;
std::experimental::optional<double> _crc_check_chance;
public:
compression_parameters() = default;
compression_parameters(compressor c) : _compressor(c) { }
compression_parameters(compressor c = compressor::lz4) : _compressor(c) { }
compression_parameters(const std::map<sstring, sstring>& options) {
validate_options(options);
auto it = options.find(SSTABLE_COMPRESSION);
if (it == options.end() || it->second.empty()) {
_compressor = compressor::none;
return;
}
const auto& compressor_class = it->second;

View File

@@ -217,6 +217,12 @@ batch_size_warn_threshold_in_kb: 5
# that do not have vnodes enabled.
# initial_token:
# RPC address to broadcast to drivers and other Scylla nodes. This cannot
# be set to 0.0.0.0. If left blank, this will be set to the value of
# rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address must
# be set.
# broadcast_rpc_address: 1.2.3.4
###################################################
## Not currently supported, reserved for future use
###################################################
@@ -409,29 +415,6 @@ partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# the smaller of 1/4 of heap or 512MB.
# file_cache_size_in_mb: 512
# Total permitted memory to use for memtables. Scylla will stop
# accepting writes when the limit is exceeded until a flush completes,
# and will trigger a flush based on memtable_cleanup_threshold
# If omitted, Scylla will set both to 1/4 the size of the heap.
# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048
# Ratio of occupied non-flushing memtable size to total permitted size
# that will trigger a flush of the largest memtable. Lager mct will
# mean larger flushes and hence less compaction, but also less concurrent
# flush activity which can make it difficult to keep your disks fed
# under heavy write load.
#
# memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1)
# memtable_cleanup_threshold: 0.11
# Specify the way Scylla allocates and manages memtable memory.
# Options are:
# heap_buffers: on heap nio buffers
# offheap_buffers: off heap (direct) nio buffers
# offheap_objects: native memory, eliminating nio buffer heap overhead
# memtable_allocation_type: heap_buffers
# Total space to use for commitlogs.
#
# If space gets above this value (it will round up to the next nearest
@@ -443,17 +426,6 @@ partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# available for Scylla.
commitlog_total_space_in_mb: -1
# This sets the amount of memtable flush writer threads. These will
# be blocked by disk io, and each one will hold a memtable in memory
# while blocked.
#
# memtable_flush_writers defaults to the smaller of (number of disks,
# number of cores), with a minimum of 2 and a maximum of 8.
#
# If your data directories are backed by SSD, you should increase this
# to the number of cores.
#memtable_flush_writers: 8
# A fixed memory pool size in MB for for SSTable index summaries. If left
# empty, this will default to 5% of the heap size. If the memory usage of
# all index summaries exceeds this limit, SSTables with low read rates will
@@ -518,13 +490,6 @@ commitlog_total_space_in_mb: -1
# Whether to start the thrift rpc server.
# start_rpc: true
# RPC address to broadcast to drivers and other Scylla nodes. This cannot
# be set to 0.0.0.0. If left blank, this will be set to the value of
# rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address must
# be set.
# broadcast_rpc_address: 1.2.3.4
# enable or disable keepalive on rpc/native connections
# rpc_keepalive: true
@@ -823,3 +788,23 @@ commitlog_total_space_in_mb: -1
# By default, Scylla binds all interfaces to the prometheus API
# It is possible to restrict the listening address to a specific one
# prometheus_address: 0.0.0.0
# Distribution of data among cores (shards) within a node
#
# Scylla distributes data within a node among shards, using a round-robin
# strategy:
# [shard0] [shard1] ... [shardN-1] [shard0] [shard1] ... [shardN-1] ...
#
# Scylla versions 1.6 and below used just one repetition of the pattern;
# this intefered with data placement among nodes (vnodes).
#
# Scylla versions 1.7 and above use 4096 repetitions of the pattern; this
# provides for better data distribution.
#
# the value below is log (base 2) of the number of repetitions.
#
# Set to 0 to avoid rewriting all data when upgrading from Scylla 1.6 and
# below.
#
# Keep at 12 for new clusters.
murmur3_partitioner_ignore_msb_bits: 12

View File

@@ -108,6 +108,11 @@ def debug_flag(compiler):
print('Note: debug information disabled; upgrade your compiler')
return ''
def maybe_static(flag, libs):
if flag and not args.static:
libs = '-Wl,-Bstatic {} -Wl,-Bdynamic'.format(libs)
return libs
class Thrift(object):
def __init__(self, source, service):
self.source = source
@@ -184,7 +189,6 @@ scylla_tests = [
'tests/storage_proxy_test',
'tests/schema_change_test',
'tests/mutation_reader_test',
'tests/key_reader_test',
'tests/mutation_query_test',
'tests/row_cache_test',
'tests/test-serialization',
@@ -223,6 +227,10 @@ scylla_tests = [
'tests/nonwrapping_range_test',
'tests/input_stream_test',
'tests/sstable_atomic_deletion_test',
'tests/virtual_reader_test',
'tests/view_schema_test',
'tests/counter_test',
'tests/cell_locker_test',
]
apps = [
@@ -264,7 +272,9 @@ arg_parser.add_argument('--debuginfo', action = 'store', dest = 'debuginfo', typ
arg_parser.add_argument('--static-stdc++', dest = 'staticcxx', action = 'store_true',
help = 'Link libgcc and libstdc++ statically')
arg_parser.add_argument('--static-thrift', dest = 'staticthrift', action = 'store_true',
help = 'Link libthrift statically')
help = 'Link libthrift statically')
arg_parser.add_argument('--static-boost', dest = 'staticboost', action = 'store_true',
help = 'Link boost statically')
arg_parser.add_argument('--tests-debuginfo', action = 'store', dest = 'tests_debuginfo', type = int, default = 0,
help = 'Enable(1)/disable(0)compiler debug information generation for tests')
arg_parser.add_argument('--python', action = 'store', dest = 'python', default = 'python3',
@@ -293,6 +303,7 @@ scylla_core = (['database.cc',
'memtable.cc',
'schema_mutations.cc',
'release.cc',
'supervisor.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'mutation_partition.cc',
@@ -300,8 +311,8 @@ scylla_core = (['database.cc',
'mutation_partition_serializer.cc',
'mutation_reader.cc',
'mutation_query.cc',
'key_reader.cc',
'keys.cc',
'counters.cc',
'sstables/sstables.cc',
'sstables/compress.cc',
'sstables/row.cc',
@@ -330,10 +341,12 @@ scylla_core = (['database.cc',
'cql3/statements/authentication_statement.cc',
'cql3/statements/create_keyspace_statement.cc',
'cql3/statements/create_table_statement.cc',
'cql3/statements/create_view_statement.cc',
'cql3/statements/create_type_statement.cc',
'cql3/statements/create_user_statement.cc',
'cql3/statements/drop_keyspace_statement.cc',
'cql3/statements/drop_table_statement.cc',
'cql3/statements/drop_view_statement.cc',
'cql3/statements/drop_type_statement.cc',
'cql3/statements/schema_altering_statement.cc',
'cql3/statements/ks_prop_defs.cc',
@@ -350,6 +363,7 @@ scylla_core = (['database.cc',
'cql3/statements/create_index_statement.cc',
'cql3/statements/truncate_statement.cc',
'cql3/statements/alter_table_statement.cc',
'cql3/statements/alter_view_statement.cc',
'cql3/statements/alter_user_statement.cc',
'cql3/statements/drop_user_statement.cc',
'cql3/statements/list_users_statement.cc',
@@ -395,6 +409,7 @@ scylla_core = (['database.cc',
'cql3/selection/selector.cc',
'cql3/restrictions/statement_restrictions.cc',
'cql3/result_set.cc',
'cql3/variable_specifications.cc',
'db/consistency_level.cc',
'db/system_keyspace.cc',
'db/schema_tables.cc',
@@ -487,7 +502,7 @@ scylla_core = (['database.cc',
'tracing/trace_state.cc',
'range_tombstone.cc',
'range_tombstone_list.cc',
'db/size_estimates_recorder.cc'
'disk-error-handler.cc'
]
+ [Antlr3Grammar('cql3/Cql.g')]
+ [Thrift('interface/cassandra.thrift', 'Cassandra')]
@@ -548,6 +563,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/idl_test.idl.hh',
'idl/commitlog.idl.hh',
'idl/tracing.idl.hh',
'idl/consistency_level.idl.hh',
]
scylla_tests_dependencies = scylla_core + api + idls + [
@@ -566,49 +582,55 @@ deps = {
'scylla': idls + ['main.cc'] + scylla_core + api,
}
tests_not_using_seastar_test_framework = set([
'tests/keys_test',
pure_boost_tests = set([
'tests/partitioner_test',
'tests/map_difference_test',
'tests/keys_test',
'tests/compound_test',
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
'tests/nonwrapping_range_test',
'tests/test-serialization',
'tests/range_test',
'tests/crc_test',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/cartesian_product_test',
])
tests_not_using_seastar_test_framework = set([
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
'tests/lsa_sync_eviction_test',
'tests/row_cache_alloc_stress',
'tests/perf_row_cache_update',
'tests/cartesian_product_test',
'tests/perf/perf_hash',
'tests/perf/perf_cql_parser',
'tests/message',
'tests/perf/perf_simple_query',
'tests/memory_footprint',
'tests/test-serialization',
'tests/gossip',
'tests/compound_test',
'tests/range_test',
'tests/crc_test',
'tests/perf/perf_sstable',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
'tests/nonwrapping_range_test',
])
]) | pure_boost_tests
for t in tests_not_using_seastar_test_framework:
if not t in scylla_tests:
raise Exception("Test %s not found in scylla_tests" % (t))
for t in scylla_tests:
deps[t] = scylla_tests_dependencies + [t + '.cc']
deps[t] = [t + '.cc']
if t not in tests_not_using_seastar_test_framework:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc']
deps['tests/bytes_ostream_test'] = ['tests/bytes_ostream_test.cc']
deps['tests/input_stream_test'] = ['tests/input_stream_test.cc']
deps['tests/UUID_test'] = ['utils/UUID_gen.cc', 'tests/UUID_test.cc']
deps['tests/UUID_test'] = ['utils/UUID_gen.cc', 'tests/UUID_test.cc', 'utils/uuid.cc']
deps['tests/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'tests/murmur_hash_test.cc']
deps['tests/allocation_strategy_test'] = ['tests/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/anchorless_list_test'] = ['tests/anchorless_list_test.cc']
@@ -706,6 +728,8 @@ elif args.dpdk_target:
seastar_flags += ['--dpdk-target', args.dpdk_target]
if args.staticcxx:
seastar_flags += ['--static-stdc++']
if args.staticboost:
seastar_flags += ['--static-boost']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_flags += ['--compiler', args.cxx, '--cflags=%s' % (seastar_cflags)]
@@ -739,7 +763,14 @@ for mode in build_modes:
seastar_deps = 'practically_anything_can_change_so_lets_run_it_every_time_and_restat.'
args.user_cflags += " " + pkg_config("--cflags", "jsoncpp")
libs = "-lyaml-cpp -llz4 -lz -lsnappy " + pkg_config("--libs", "jsoncpp") + ' -lboost_filesystem' + ' -lcrypt' + ' -lboost_date_time'
libs = ' '.join(['-lyaml-cpp', '-llz4', '-lz', '-lsnappy', pkg_config("--libs", "jsoncpp"),
maybe_static(args.staticboost, '-lboost_filesystem'), ' -lcrypt',
maybe_static(args.staticboost, '-lboost_date_time'),
])
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
for pkg in pkgs:
args.user_cflags += ' ' + pkg_config('--cflags', pkg)
libs += ' ' + pkg_config('--libs', pkg)
@@ -769,6 +800,8 @@ with open(buildfile, 'w') as f:
libs = {libs}
pool link_pool
depth = {link_pool_depth}
pool seastar_pool
depth = 1
rule ragel
command = ragel -G2 -o $out $in
description = RAGEL $out
@@ -794,7 +827,7 @@ with open(buildfile, 'w') as f:
f.write(textwrap.dedent('''\
cxxflags_{mode} = -I. -I $builddir/{mode}/gen -I seastar -I seastar/build/{mode}/gen
rule cxx.{mode}
command = $cxx -MMD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} -c -o $out $in
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} -c -o $out $in
description = CXX $out
depfile = $out.d
rule link.{mode}
@@ -853,6 +886,11 @@ with open(buildfile, 'w') as f:
f.write('build $builddir/{}/{}: ar.{} {}\n'.format(mode, binary, mode, str.join(' ', objs)))
else:
if binary.startswith('tests/'):
local_libs = '$libs'
if binary not in tests_not_using_seastar_test_framework or binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
# Our code's debugging information is huge, and multiplied
# by many tests yields ridiculous amounts of disk space.
# So we strip the tests by default; The user can very
@@ -860,15 +898,15 @@ with open(buildfile, 'w') as f:
# to the test name, e.g., "ninja build/release/testname_g"
f.write('build $builddir/{}/{}: {}.{} {} {}\n'.format(mode, binary, tests_link_rule, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
if has_thrift:
f.write(' libs = {} -lboost_system $libs\n'.format(thrift_libs))
f.write(' libs = {}\n'.format(local_libs))
f.write('build $builddir/{}/{}_g: link.{} {} {}\n'.format(mode, binary, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
f.write(' libs = {}\n'.format(local_libs))
else:
f.write('build $builddir/{}/{}: link.{} {} {}\n'.format(mode, binary, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
if has_thrift:
f.write(' libs = {} -lboost_system $libs\n'.format(thrift_libs))
if has_thrift:
f.write(' libs = {} {} $libs\n'.format(thrift_libs, maybe_static(args.staticboost, '-lboost_system')))
for src in srcs:
if src.endswith('.cc'):
obj = '$builddir/' + mode + '/' + src.replace('.cc', '.o')
@@ -926,6 +964,7 @@ with open(buildfile, 'w') as f:
f.write('build {}: cxx.{} {} || {}\n'.format(obj, mode, cc, ' '.join(serializers)))
f.write('build seastar/build/{mode}/libseastar.a seastar/build/{mode}/apps/iotune/iotune seastar/build/{mode}/gen/http/request_parser.hh seastar/build/{mode}/gen/http/http_response_parser.hh: ninja {seastar_deps}\n'
.format(**locals()))
f.write(' pool = seastar_pool\n')
f.write(' subdir = seastar\n')
f.write(' target = build/{mode}/libseastar.a build/{mode}/apps/iotune/iotune build/{mode}/gen/http/request_parser.hh build/{mode}/gen/http/http_response_parser.hh\n'.format(**locals()))
f.write(textwrap.dedent('''\

271
counters.cc Normal file
View File

@@ -0,0 +1,271 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "service/storage_service.hh"
#include "counters.hh"
#include "mutation.hh"
#include "combine.hh"
counter_id counter_id::local()
{
return counter_id(service::get_local_storage_service().get_local_id());
}
bool counter_id::less_compare_1_7_4::operator()(const counter_id& a, const counter_id& b) const
{
if (a._most_significant != b._most_significant) {
return a._most_significant < b._most_significant;
} else {
return a._least_significant < b._least_significant;
}
}
std::ostream& operator<<(std::ostream& os, const counter_id& id) {
return os << id.to_uuid();
}
std::ostream& operator<<(std::ostream& os, counter_shard_view csv) {
return os << "{global_shard id: " << csv.id() << " value: " << csv.value()
<< " clock: " << csv.logical_clock() << "}";
}
std::ostream& operator<<(std::ostream& os, counter_cell_view ccv) {
return os << "{counter_cell timestamp: " << ccv.timestamp() << " shards: {" << ::join(", ", ccv.shards()) << "}}";
}
void counter_cell_builder::do_sort_and_remove_duplicates()
{
boost::range::sort(_shards, [] (auto& a, auto& b) { return a.id() < b.id(); });
std::vector<counter_shard> new_shards;
new_shards.reserve(_shards.size());
for (auto& cs : _shards) {
if (new_shards.empty() || new_shards.back().id() != cs.id()) {
new_shards.emplace_back(cs);
} else {
new_shards.back().apply(cs);
}
}
_shards = std::move(new_shards);
_sorted = true;
}
std::vector<counter_shard> counter_cell_view::shards_compatible_with_1_7_4() const
{
auto sorted_shards = boost::copy_range<std::vector<counter_shard>>(shards());
counter_id::less_compare_1_7_4 cmp;
boost::range::sort(sorted_shards, [&] (auto& a, auto& b) {
return cmp(a.id(), b.id());
});
return sorted_shards;
}
bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
// TODO: optimise for single shard existing in the other
// TODO: optimise for no new shards?
auto dst_ac = dst.as_atomic_cell();
auto src_ac = src.as_atomic_cell();
if (!dst_ac.is_live() || !src_ac.is_live()) {
if (dst_ac.is_live() || (!src_ac.is_live() && compare_atomic_cell_for_merge(dst_ac, src_ac) < 0)) {
std::swap(dst, src);
return true;
}
return false;
}
if (dst_ac.is_counter_update() && src_ac.is_counter_update()) {
// FIXME: store deltas just as a normal int64_t and get rid of these calls
// to long_type
auto src_v = value_cast<int64_t>(long_type->deserialize_value(src_ac.value()));
auto dst_v = value_cast<int64_t>(long_type->deserialize_value(dst_ac.value()));
dst = atomic_cell::make_live_counter_update(std::max(dst_ac.timestamp(), src_ac.timestamp()),
long_type->decompose(src_v + dst_v));
return true;
}
assert(!dst_ac.is_counter_update());
assert(!src_ac.is_counter_update());
auto a_shards = counter_cell_view(dst_ac).shards();
auto b_shards = counter_cell_view(src_ac).shards();
counter_cell_builder result;
combine(a_shards.begin(), a_shards.end(), b_shards.begin(), b_shards.end(),
result.inserter(), counter_shard_view::less_compare_by_id(), [] (auto& x, auto& y) {
return x.logical_clock() < y.logical_clock() ? y : x;
});
auto cell = result.build(std::max(dst_ac.timestamp(), src_ac.timestamp()));
src = std::exchange(dst, atomic_cell_or_collection(cell));
return true;
}
void counter_cell_view::revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
if (dst.as_atomic_cell().is_counter_update()) {
auto src_v = value_cast<int64_t>(long_type->deserialize_value(src.as_atomic_cell().value()));
auto dst_v = value_cast<int64_t>(long_type->deserialize_value(dst.as_atomic_cell().value()));
dst = atomic_cell::make_live(dst.as_atomic_cell().timestamp(),
long_type->decompose(dst_v - src_v));
} else {
std::swap(dst, src);
}
}
stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, atomic_cell_view b)
{
assert(!a.is_counter_update());
assert(!b.is_counter_update());
if (!b.is_live()) {
return { };
} else if (!a.is_live()) {
return atomic_cell(a);
}
auto a_shards = counter_cell_view(a).shards();
auto b_shards = counter_cell_view(b).shards();
auto a_it = a_shards.begin();
auto a_end = a_shards.end();
auto b_it = b_shards.begin();
auto b_end = b_shards.end();
counter_cell_builder result;
while (a_it != a_end) {
while (b_it != b_end && (*b_it).id() < (*a_it).id()) {
++b_it;
}
if (b_it == b_end || (*a_it).id() != (*b_it).id() || (*a_it).logical_clock() > (*b_it).logical_clock()) {
result.add_shard(counter_shard(*a_it));
}
++a_it;
}
stdx::optional<atomic_cell> diff;
if (!result.empty()) {
diff = result.build(std::max(a.timestamp(), b.timestamp()));
} else if (a.timestamp() > b.timestamp()) {
diff = atomic_cell::make_live(a.timestamp(), bytes_view());
}
return diff;
}
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset) {
// FIXME: allow current_state to be frozen_mutation
auto transform_new_row_to_shards = [clock_offset] (auto& cells) {
cells.for_each_cell([clock_offset] (auto, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
auto delta = value_cast<int64_t>(long_type->deserialize_value(acv.value()));
counter_cell_builder ccb;
ccb.add_shard(counter_shard(counter_id::local(), delta, clock_offset + 1));
ac_o_c = ccb.build(acv.timestamp());
});
};
if (!current_state) {
transform_new_row_to_shards(m.partition().static_row());
for (auto& cr : m.partition().clustered_rows()) {
transform_new_row_to_shards(cr.row().cells());
}
return;
}
clustering_key::less_compare cmp(*m.schema());
auto transform_row_to_shards = [clock_offset] (auto& transformee, auto& state) {
struct counter_shard_or_tombstone {
stdx::optional<counter_shard> shard;
tombstone tomb;
};
std::deque<std::pair<column_id, counter_shard_or_tombstone>> shards;
state.for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
counter_shard_or_tombstone cs_o_t { { },
tombstone(acv.timestamp(), acv.deletion_time()) };
shards.emplace_back(std::make_pair(id, cs_o_t));
return; // continue -- we are in lambda
}
counter_cell_view ccv(acv);
auto cs = ccv.local_shard();
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard_or_tombstone { counter_shard(*cs), tombstone() }));
});
transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
while (!shards.empty() && shards.front().first < id) {
shards.pop_front();
}
auto delta = value_cast<int64_t>(long_type->deserialize_value(acv.value()));
counter_cell_builder ccb;
if (shards.empty() || shards.front().first > id) {
ccb.add_shard(counter_shard(counter_id::local(), delta, clock_offset + 1));
} else if (shards.front().second.tomb.timestamp == api::missing_timestamp) {
auto& cs = *shards.front().second.shard;
cs.update(delta, clock_offset + 1);
ccb.add_shard(cs);
shards.pop_front();
} else {
// We are apply the tombstone that's already there second time.
// It is not necessary but there is no easy way to remove cell
// from a mutation.
tombstone t = shards.front().second.tomb;
ac_o_c = atomic_cell::make_dead(t.timestamp, t.deletion_time);
shards.pop_front();
return; // continue -- we are in lambda
}
ac_o_c = ccb.build(acv.timestamp());
});
};
transform_row_to_shards(m.partition().static_row(), current_state->partition().static_row());
auto& cstate = current_state->partition();
auto it = cstate.clustered_rows().begin();
auto end = cstate.clustered_rows().end();
for (auto& cr : m.partition().clustered_rows()) {
while (it != end && cmp(it->key(), cr.key())) {
++it;
}
if (it == end || cmp(cr.key(), it->key())) {
transform_new_row_to_shards(cr.row().cells());
continue;
}
transform_row_to_shards(cr.row().cells(), it->row().cells());
}
}

385
counters.hh Normal file
View File

@@ -0,0 +1,385 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <boost/range/algorithm/find_if.hpp>
#include "atomic_cell_or_collection.hh"
#include "types.hh"
#include "stdx.hh"
class mutation;
class mutation;
class counter_id {
int64_t _least_significant;
int64_t _most_significant;
public:
static_assert(std::is_same<decltype(std::declval<utils::UUID>().get_least_significant_bits()), int64_t>::value
&& std::is_same<decltype(std::declval<utils::UUID>().get_most_significant_bits()), int64_t>::value,
"utils::UUID is expected to work with two signed 64-bit integers");
counter_id() = default;
explicit counter_id(utils::UUID uuid) noexcept
: _least_significant(uuid.get_least_significant_bits())
, _most_significant(uuid.get_most_significant_bits())
{ }
utils::UUID to_uuid() const {
return utils::UUID(_most_significant, _least_significant);
}
bool operator<(const counter_id& other) const {
return to_uuid() < other.to_uuid();
}
bool operator>(const counter_id& other) const {
return other.to_uuid() < to_uuid();
}
bool operator==(const counter_id& other) const {
return to_uuid() == other.to_uuid();
}
bool operator!=(const counter_id& other) const {
return !(*this == other);
}
public:
// (Wrong) Counter ID ordering used by Scylla 1.7.4 and earlier.
struct less_compare_1_7_4 {
bool operator()(const counter_id& a, const counter_id& b) const;
};
public:
static counter_id local();
// For tests.
static counter_id generate_random() {
return counter_id(utils::make_random_uuid());
}
};
static_assert(std::is_pod<counter_id>::value, "counter_id should be a POD type");
std::ostream& operator<<(std::ostream& os, const counter_id& id);
class counter_shard_view {
enum class offset : unsigned {
id = 0u,
value = unsigned(id) + sizeof(counter_id),
logical_clock = unsigned(value) + sizeof(int64_t),
total_size = unsigned(logical_clock) + sizeof(int64_t),
};
private:
bytes_view::const_pointer _base;
private:
template<typename T>
T read(offset off) const {
T value;
std::copy_n(_base + static_cast<unsigned>(off), sizeof(T), reinterpret_cast<char*>(&value));
return value;
}
public:
static constexpr auto size = size_t(offset::total_size);
public:
counter_shard_view() = default;
explicit counter_shard_view(bytes_view::const_pointer ptr) noexcept
: _base(ptr) { }
counter_id id() const { return read<counter_id>(offset::id); }
int64_t value() const { return read<int64_t>(offset::value); }
int64_t logical_clock() const { return read<int64_t>(offset::logical_clock); }
bool operator==(const counter_shard_view& other) const {
return id() == other.id() && value() == other.value()
&& logical_clock() == other.logical_clock();
}
bool operator!=(const counter_shard_view& other) const {
return !(*this == other);
}
struct less_compare_by_id {
bool operator()(const counter_shard_view& x, const counter_shard_view& y) const {
return x.id() < y.id();
}
};
};
std::ostream& operator<<(std::ostream& os, counter_shard_view csv);
class counter_shard {
counter_id _id;
int64_t _value;
int64_t _logical_clock;
private:
template<typename T>
static void write(const T& value, bytes::iterator& out) {
out = std::copy_n(reinterpret_cast<const char*>(&value), sizeof(T), out);
}
private:
// Shared logic for applying counter_shards and counter_shard_views.
// T is either counter_shard or basic_counter_shard_view<U>.
template<typename T>
counter_shard& do_apply(T&& other) noexcept {
auto other_clock = other.logical_clock();
if (_logical_clock < other_clock) {
_logical_clock = other_clock;
_value = other.value();
}
return *this;
}
public:
counter_shard(counter_id id, int64_t value, int64_t logical_clock) noexcept
: _id(id)
, _value(value)
, _logical_clock(logical_clock)
{ }
explicit counter_shard(counter_shard_view csv) noexcept
: _id(csv.id())
, _value(csv.value())
, _logical_clock(csv.logical_clock())
{ }
counter_id id() const { return _id; }
int64_t value() const { return _value; }
int64_t logical_clock() const { return _logical_clock; }
counter_shard& update(int64_t value_delta, int64_t clock_increment) noexcept {
_value += value_delta;
_logical_clock += clock_increment;
return *this;
}
counter_shard& apply(counter_shard_view other) noexcept {
return do_apply(other);
}
counter_shard& apply(const counter_shard& other) noexcept {
return do_apply(other);
}
static size_t serialized_size() {
return counter_shard_view::size;
}
void serialize(bytes::iterator& out) const {
write(_id, out);
write(_value, out);
write(_logical_clock, out);
}
};
class counter_cell_builder {
std::vector<counter_shard> _shards;
bool _sorted = true;
private:
void do_sort_and_remove_duplicates();
public:
counter_cell_builder() = default;
counter_cell_builder(size_t shard_count) {
_shards.reserve(shard_count);
}
void add_shard(const counter_shard& cs) {
_shards.emplace_back(cs);
}
void add_maybe_unsorted_shard(const counter_shard& cs) {
add_shard(cs);
if (_sorted && _shards.size() > 1) {
auto current = _shards.rbegin();
auto previous = std::next(current);
_sorted = current->id() > previous->id();
}
}
void sort_and_remove_duplicates() {
if (!_sorted) {
do_sort_and_remove_duplicates();
}
}
size_t serialized_size() const {
return _shards.size() * counter_shard::serialized_size();
}
void serialize(bytes::iterator& out) const {
for (auto&& cs : _shards) {
cs.serialize(out);
}
}
bool empty() const {
return _shards.empty();
}
atomic_cell build(api::timestamp_type timestamp) const {
bytes b(bytes::initialized_later(), serialized_size());
auto out = b.begin();
serialize(out);
return atomic_cell::make_live(timestamp, b);
}
class inserter_iterator : public std::iterator<std::output_iterator_tag, counter_shard> {
counter_cell_builder* _builder;
public:
explicit inserter_iterator(counter_cell_builder& b) : _builder(&b) { }
inserter_iterator& operator=(const counter_shard& cs) {
_builder->add_shard(cs);
return *this;
}
inserter_iterator& operator=(const counter_shard_view& csv) {
return operator=(counter_shard(csv));
}
inserter_iterator& operator++() { return *this; }
inserter_iterator& operator++(int) { return *this; }
inserter_iterator& operator*() { return *this; };
};
inserter_iterator inserter() {
return inserter_iterator(*this);
}
};
// <counter_id> := <int64_t><int64_t>
// <shard> := <counter_id><int64_t:value><int64_t:logical_clock>
// <counter_cell> := <shard>*
class counter_cell_view {
atomic_cell_view _cell;
private:
class shard_iterator : public std::iterator<std::input_iterator_tag, const counter_shard_view> {
bytes_view::const_pointer _current;
counter_shard_view _current_view;
public:
shard_iterator() = default;
shard_iterator(bytes_view::const_pointer ptr) noexcept
: _current(ptr), _current_view(ptr) { }
const counter_shard_view& operator*() const noexcept {
return _current_view;
}
const counter_shard_view* operator->() const noexcept {
return &_current_view;
}
shard_iterator& operator++() noexcept {
_current += counter_shard_view::size;
_current_view = counter_shard_view(_current);
return *this;
}
shard_iterator operator++(int) noexcept {
auto it = *this;
operator++();
return it;
}
bool operator==(const shard_iterator& other) const noexcept {
return _current == other._current;
}
bool operator!=(const shard_iterator& other) const noexcept {
return !(*this == other);
}
};
public:
boost::iterator_range<shard_iterator> shards() const {
auto bv = _cell.value();
auto begin = shard_iterator(bv.data());
auto end = shard_iterator(bv.data() + bv.size());
return boost::make_iterator_range(begin, end);
}
size_t shard_count() const {
return _cell.value().size() / counter_shard_view::size;
}
public:
// ac must be a live counter cell
explicit counter_cell_view(atomic_cell_view ac) noexcept : _cell(ac) {
assert(_cell.is_live());
assert(!_cell.is_counter_update());
}
api::timestamp_type timestamp() const { return _cell.timestamp(); }
static data_type total_value_type() { return long_type; }
int64_t total_value() const {
return boost::accumulate(shards(), int64_t(0), [] (int64_t v, counter_shard_view cs) {
return v + cs.value();
});
}
stdx::optional<counter_shard_view> get_shard(const counter_id& id) const {
auto it = boost::range::find_if(shards(), [&id] (counter_shard_view csv) {
return csv.id() == id;
});
if (it == shards().end()) {
return { };
}
return *it;
}
stdx::optional<counter_shard_view> local_shard() const {
// TODO: consider caching local shard position
return get_shard(counter_id::local());
}
bool operator==(const counter_cell_view& other) const {
return timestamp() == other.timestamp() && boost::equal(shards(), other.shards());
}
// Returns counter shards in an order that is compatible with Scylla 1.7.4.
std::vector<counter_shard> shards_compatible_with_1_7_4() const;
// Reversibly applies two counter cells, at least one of them must be live.
// Returns true iff dst was modified.
static bool apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Reverts apply performed by apply_reversible().
static void revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Computes a counter cell containing minimal amount of data which, when
// applied to 'b' returns the same cell as 'a' and 'b' applied together.
static stdx::optional<atomic_cell> difference(atomic_cell_view a, atomic_cell_view b);
friend std::ostream& operator<<(std::ostream& os, counter_cell_view ccv);
};
// Transforms mutation dst from counter updates to counter shards using state
// stored in current_state.
// If current_state is present it has to be in the same schema as dst.
void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset);
template<>
struct appending_hash<counter_shard_view> {
template<typename Hasher>
void operator()(Hasher& h, const counter_shard_view& cshard) const {
::feed_hash(h, cshard.id().to_uuid());
::feed_hash(h, cshard.value());
::feed_hash(h, cshard.logical_clock());
}
};
template<>
struct appending_hash<counter_cell_view> {
template<typename Hasher>
void operator()(Hasher& h, const counter_cell_view& cell) const {
::feed_hash(h, true); // is_live
::feed_hash(h, cell.timestamp());
for (auto&& csv : cell.shards()) {
::feed_hash(h, csv);
}
}
};

View File

@@ -36,15 +36,18 @@ options {
#include "cql3/statements/raw/select_statement.hh"
#include "cql3/statements/alter_keyspace_statement.hh"
#include "cql3/statements/alter_table_statement.hh"
#include "cql3/statements/alter_view_statement.hh"
#include "cql3/statements/create_keyspace_statement.hh"
#include "cql3/statements/drop_keyspace_statement.hh"
#include "cql3/statements/create_index_statement.hh"
#include "cql3/statements/create_table_statement.hh"
#include "cql3/statements/create_view_statement.hh"
#include "cql3/statements/create_type_statement.hh"
#include "cql3/statements/drop_type_statement.hh"
#include "cql3/statements/alter_type_statement.hh"
#include "cql3/statements/property_definitions.hh"
#include "cql3/statements/drop_table_statement.hh"
#include "cql3/statements/drop_view_statement.hh"
#include "cql3/statements/truncate_statement.hh"
#include "cql3/statements/raw/update_statement.hh"
#include "cql3/statements/raw/insert_statement.hh"
@@ -340,6 +343,9 @@ cqlStatement returns [shared_ptr<raw::parsed_statement> stmt]
| st30=createAggregateStatement { $stmt = st30; }
| st31=dropAggregateStatement { $stmt = st31; }
#endif
| st32=createViewStatement { $stmt = st32; }
| st33=alterViewStatement { $stmt = st33; }
| st34=dropViewStatement { $stmt = st34; }
;
/*
@@ -716,7 +722,7 @@ createTableStatement returns [shared_ptr<cql3::statements::create_table_statemen
cfamDefinition[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
: '(' cfamColumns[expr] ( ',' cfamColumns[expr]? )* ')'
( K_WITH cfamProperty[expr] ( K_AND cfamProperty[expr] )*)?
( K_WITH cfamProperty[$expr->properties()] ( K_AND cfamProperty[$expr->properties()] )*)?
;
cfamColumns[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
@@ -732,15 +738,15 @@ pkDef[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
| '(' k1=ident { l.push_back(k1); } ( ',' kn=ident { l.push_back(kn); } )* ')' { $expr->add_key_aliases(l); }
;
cfamProperty[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
: property[expr->properties]
| K_COMPACT K_STORAGE { $expr->set_compact_storage(); }
cfamProperty[cql3::statements::cf_properties& expr]
: property[$expr.properties()]
| K_COMPACT K_STORAGE { $expr.set_compact_storage(); }
| K_CLUSTERING K_ORDER K_BY '(' cfamOrdering[expr] (',' cfamOrdering[expr])* ')'
;
cfamOrdering[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
cfamOrdering[cql3::statements::cf_properties& expr]
@init{ bool reversed=false; }
: k=ident (K_ASC | K_DESC { reversed=true;} ) { $expr->set_ordering(k, reversed); }
: k=ident (K_ASC | K_DESC { reversed=true;} ) { $expr.set_ordering(k, reversed); }
;
@@ -787,6 +793,39 @@ indexIdent returns [::shared_ptr<index_target::raw> id]
| K_FULL '(' c=cident ')' { $id = index_target::raw::full_collection(c); }
;
/**
* CREATE MATERIALIZED VIEW <viewName> AS
* SELECT <columns>
* FROM <CF>
* WHERE <pkColumns> IS NOT NULL
* PRIMARY KEY (<pkColumns>)
* WITH <property> = <value> AND ...;
*/
createViewStatement returns [::shared_ptr<create_view_statement> expr]
@init {
bool if_not_exists = false;
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys;
std::vector<::shared_ptr<cql3::column_identifier::raw>> composite_keys;
}
: K_CREATE K_MATERIALIZED K_VIEW (K_IF K_NOT K_EXISTS { if_not_exists = true; })? cf=columnFamilyName K_AS
K_SELECT sclause=selectClause K_FROM basecf=columnFamilyName
(K_WHERE wclause=whereClause)?
K_PRIMARY K_KEY (
'(' '(' k1=cident { partition_keys.push_back(k1); } ( ',' kn=cident { partition_keys.push_back(kn); } )* ')' ( ',' c1=cident { composite_keys.push_back(c1); } )* ')'
| '(' k1=cident { partition_keys.push_back(k1); } ( ',' cn=cident { composite_keys.push_back(cn); } )* ')'
)
{
$expr = ::make_shared<create_view_statement>(
std::move(cf),
std::move(basecf),
std::move(sclause),
std::move(wclause),
std::move(partition_keys),
std::move(composite_keys),
if_not_exists);
}
( K_WITH cfamProperty[{ $expr->properties() }] ( K_AND cfamProperty[{ $expr->properties() }] )*)?
;
#if 0
/**
@@ -833,7 +872,7 @@ alterKeyspaceStatement returns [shared_ptr<cql3::statements::alter_keyspace_stat
alterTableStatement returns [shared_ptr<alter_table_statement> expr]
@init {
alter_table_statement::type type;
auto props = make_shared<cql3::statements::cf_prop_defs>();;
auto props = make_shared<cql3::statements::cf_prop_defs>();
std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>, shared_ptr<cql3::column_identifier::raw>>> renames;
bool is_static = false;
}
@@ -867,6 +906,18 @@ alterTypeStatement returns [::shared_ptr<alter_type_statement> expr]
)
;
/**
* ALTER MATERIALIZED VIEW <CF> WITH <property> = <value>;
*/
alterViewStatement returns [::shared_ptr<alter_view_statement> expr]
@init {
auto props = make_shared<cql3::statements::cf_prop_defs>();
}
: K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[props]
{
$expr = ::make_shared<alter_view_statement>(std::move(cf), std::move(props));
}
;
renames[::shared_ptr<alter_type_statement::renames> expr]
: fromId=ident K_TO toId=ident { $expr->add_rename(fromId, toId); }
@@ -897,6 +948,15 @@ dropTypeStatement returns [::shared_ptr<drop_type_statement> stmt]
: K_DROP K_TYPE (K_IF K_EXISTS { if_exists = true; } )? name=userTypeName { $stmt = ::make_shared<drop_type_statement>(name, if_exists); }
;
/**
* DROP MATERIALIZED VIEW [IF EXISTS] <view_name>
*/
dropViewStatement returns [::shared_ptr<drop_view_statement> stmt]
@init { bool if_exists = false; }
: K_DROP K_MATERIALIZED K_VIEW (K_IF K_EXISTS { if_exists = true; } )? cf=columnFamilyName
{ $stmt = ::make_shared<drop_view_statement>(cf, if_exists); }
;
#if 0
/**
* DROP INDEX [IF EXISTS] <INDEX_NAME>
@@ -1304,7 +1364,8 @@ relation[std::vector<cql3::relation_ptr>& clauses]
| K_TOKEN l=tupleOfIdentifiers type=relationType t=term
{ $clauses.emplace_back(::make_shared<cql3::token_relation>(std::move(l), *type, std::move(t))); }
| name=cident K_IS K_NOT K_NULL {
$clauses.emplace_back(make_shared<cql3::single_column_relation>(std::move(name), cql3::operator_type::IS_NOT, cql3::constants::NULL_LITERAL)); }
| name=cident K_IN marker=inMarker
{ $clauses.emplace_back(make_shared<cql3::single_column_relation>(std::move(name), cql3::operator_type::IN, std::move(marker))); }
| name=cident K_IN in_values=singleColumnInValues
@@ -1404,12 +1465,16 @@ native_type returns [shared_ptr<cql3_type> t]
| K_FLOAT { $t = cql3_type::float_; }
| K_INET { $t = cql3_type::inet; }
| K_INT { $t = cql3_type::int_; }
| K_SMALLINT { $t = cql3_type::smallint; }
| K_TEXT { $t = cql3_type::text; }
| K_TIMESTAMP { $t = cql3_type::timestamp; }
| K_TINYINT { $t = cql3_type::tinyint; }
| K_UUID { $t = cql3_type::uuid; }
| K_VARCHAR { $t = cql3_type::varchar; }
| K_VARINT { $t = cql3_type::varint; }
| K_TIMEUUID { $t = cql3_type::timeuuid; }
| K_DATE { $t = cql3_type::date; }
| K_TIME { $t = cql3_type::time; }
;
collection_type returns [shared_ptr<cql3::cql3_type::raw> pt]
@@ -1483,6 +1548,8 @@ basic_unreserved_keyword returns [sstring str]
| K_DISTINCT
| K_CONTAINS
| K_STATIC
| K_FROZEN
| K_TUPLE
| K_FUNCTION
| K_AGGREGATE
| K_SFUNC
@@ -1528,6 +1595,8 @@ K_KEYSPACE: ( K E Y S P A C E
K_KEYSPACES: K E Y S P A C E S;
K_COLUMNFAMILY:( C O L U M N F A M I L Y
| T A B L E );
K_MATERIALIZED:M A T E R I A L I Z E D;
K_VIEW: V I E W;
K_INDEX: I N D E X;
K_CUSTOM: C U S T O M;
K_ON: O N;
@@ -1551,6 +1620,7 @@ K_DESC: D E S C;
K_ALLOW: A L L O W;
K_FILTERING: F I L T E R I N G;
K_IF: I F;
K_IS: I S;
K_CONTAINS: C O N T A I N S;
K_GRANT: G R A N T;
@@ -1580,6 +1650,8 @@ K_DOUBLE: D O U B L E;
K_FLOAT: F L O A T;
K_INET: I N E T;
K_INT: I N T;
K_SMALLINT: S M A L L I N T;
K_TINYINT: T I N Y I N T;
K_TEXT: T E X T;
K_UUID: U U I D;
K_VARCHAR: V A R C H A R;
@@ -1587,6 +1659,8 @@ K_VARINT: V A R I N T;
K_TIMEUUID: T I M E U U I D;
K_TOKEN: T O K E N;
K_WRITETIME: W R I T E T I M E;
K_DATE: D A T E;
K_TIME: T I M E;
K_NULL: N U L L;
K_NOT: N O T;

View File

@@ -71,10 +71,12 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
}
auto tval = _timestamp->bind_and_get(options);
if (!tval) {
if (tval.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value of timestamp");
}
if (tval.is_unset_value()) {
return now;
}
try {
data_type_for<int64_t>()->validate(*tval);
} catch (marshal_exception e) {
@@ -88,10 +90,12 @@ int32_t attributes::get_time_to_live(const query_options& options) {
return 0;
auto tval = _time_to_live->bind_and_get(options);
if (!tval) {
if (tval.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value of TTL");
}
if (tval.is_unset_value()) {
return 0;
}
try {
data_type_for<int32_t>()->validate(*tval);
}

View File

@@ -47,7 +47,7 @@
#include <algorithm>
#include <functional>
#include <iostream>
#include <iosfwd>
namespace cql3 {

View File

@@ -44,6 +44,7 @@
namespace cql3 {
thread_local const ::shared_ptr<constants::value> constants::UNSET_VALUE = ::make_shared<constants::value>(cql3::raw_value::make_unset_value());
thread_local const ::shared_ptr<term::raw> constants::NULL_LITERAL = ::make_shared<constants::null_literal>();
thread_local const ::shared_ptr<terminal> constants::null_literal::NULL_VALUE = ::make_shared<constants::null_literal::null_value>();
@@ -97,7 +98,9 @@ constants::literal::test_assignment(database& db, const sstring& keyspace, ::sha
cql3_type::kind::TEXT,
cql3_type::kind::INET,
cql3_type::kind::VARCHAR,
cql3_type::kind::TIMESTAMP>::contains(kind)) {
cql3_type::kind::TIMESTAMP,
cql3_type::kind::DATE,
cql3_type::kind::TIME>::contains(kind)) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
break;
@@ -109,7 +112,10 @@ constants::literal::test_assignment(database& db, const sstring& keyspace, ::sha
cql3_type::kind::DOUBLE,
cql3_type::kind::FLOAT,
cql3_type::kind::INT,
cql3_type::kind::SMALLINT,
cql3_type::kind::TIMESTAMP,
cql3_type::kind::DATE,
cql3_type::kind::TINYINT,
cql3_type::kind::VARINT>::contains(kind)) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
@@ -150,7 +156,7 @@ constants::literal::prepare(database& db, const sstring& keyspace, ::shared_ptr<
throw exceptions::invalid_request_exception(sprint("Invalid %s constant (%s) for \"%s\" of type %s",
_type, _text, *receiver->name, receiver->type->as_cql3_type()->to_string()));
}
return ::make_shared<value>(std::experimental::make_optional(parsed_value(receiver->type)));
return ::make_shared<value>(cql3::raw_value::make_value(parsed_value(receiver->type)));
}
void constants::deleter::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {

View File

@@ -44,6 +44,7 @@
#include "cql3/abstract_marker.hh"
#include "cql3/update_parameters.hh"
#include "cql3/operation.hh"
#include "cql3/values.hh"
#include "cql3/term.hh"
#include "core/shared_ptr.hh"
@@ -67,18 +68,20 @@ public:
*/
class value : public terminal {
public:
bytes_opt _bytes;
value(bytes_opt bytes_) : _bytes(std::move(bytes_)) {}
virtual bytes_opt get(const query_options& options) override { return _bytes; }
virtual bytes_view_opt bind_and_get(const query_options& options) override { return as_bytes_view_opt(_bytes); }
cql3::raw_value _bytes;
value(cql3::raw_value bytes_) : _bytes(std::move(bytes_)) {}
virtual cql3::raw_value get(const query_options& options) override { return _bytes; }
virtual cql3::raw_value_view bind_and_get(const query_options& options) override { return _bytes.to_view(); }
virtual sstring to_string() const override { return to_hex(*_bytes); }
};
static thread_local const ::shared_ptr<value> UNSET_VALUE;
class null_literal final : public term::raw {
private:
class null_value final : public value {
public:
null_value() : value({}) {}
null_value() : value(cql3::raw_value::make_null()) {}
virtual ::shared_ptr<terminal> bind(const query_options& options) override { return {}; }
virtual sstring to_string() const override { return "null"; }
};
@@ -169,14 +172,13 @@ public:
assert(!_receiver->type->is_collection());
}
virtual bytes_view_opt bind_and_get(const query_options& options) override {
virtual cql3::raw_value_view bind_and_get(const query_options& options) override {
try {
auto value = options.get_value_at(_bind_index);
if (value) {
_receiver->type->validate(*value);
return *value;
}
return std::experimental::nullopt;
return value;
} catch (const marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -187,7 +189,7 @@ public:
if (!bytes) {
return ::shared_ptr<terminal>{};
}
return ::make_shared<constants::value>(std::move(to_bytes_opt(*bytes)));
return ::make_shared<constants::value>(std::move(cql3::raw_value::make_value(to_bytes(*bytes))));
}
};
@@ -197,52 +199,46 @@ public:
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
auto cell = value ? make_cell(*value, params) : make_dead_cell(params);
m.set_cell(prefix, column, std::move(cell));
if (value.is_null()) {
m.set_cell(prefix, column, std::move(make_dead_cell(params)));
} else if (value.is_value()) {
m.set_cell(prefix, column, std::move(make_cell(*value, params)));
}
}
};
#if 0
public static class Adder extends Operation
{
public Adder(ColumnDefinition column, Term t)
{
super(column, t);
struct adder final : operation {
using operation::operation;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
if (value.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment");
} else if (value.is_unset_value()) {
return;
}
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
m.set_cell(prefix, column, make_counter_update_cell(increment, params));
}
};
public void execute(ByteBuffer rowKey, ColumnFamily cf, Composite prefix, UpdateParameters params) throws InvalidRequestException
{
ByteBuffer bytes = t.bindAndGet(params.options);
if (bytes == null)
throw new InvalidRequestException("Invalid null value for counter increment");
long increment = ByteBufferUtil.toLong(bytes);
CellName cname = cf.getComparator().create(prefix, column);
cf.addColumn(params.makeCounter(cname, increment));
struct subtracter final : operation {
using operation::operation;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
if (value.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment");
} else if (value.is_unset_value()) {
return;
}
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
if (increment == std::numeric_limits<int64_t>::min()) {
throw exceptions::invalid_request_exception(sprint("The negation of %d overflows supported counter precision (signed 8 bytes integer)", increment));
}
m.set_cell(prefix, column, make_counter_update_cell(-increment, params));
}
}
public static class Substracter extends Operation
{
public Substracter(ColumnDefinition column, Term t)
{
super(column, t);
}
public void execute(ByteBuffer rowKey, ColumnFamily cf, Composite prefix, UpdateParameters params) throws InvalidRequestException
{
ByteBuffer bytes = t.bindAndGet(params.options);
if (bytes == null)
throw new InvalidRequestException("Invalid null value for counter increment");
long increment = ByteBufferUtil.toLong(bytes);
if (increment == Long.MIN_VALUE)
throw new InvalidRequestException("The negation of " + increment + " overflows supported counter precision (signed 8 bytes integer)");
CellName cname = cf.getComparator().create(prefix, column);
cf.addColumn(params.makeCounter(cname, -increment));
}
}
#endif
};
class deleter : public operation {
public:

View File

@@ -273,11 +273,15 @@ thread_local shared_ptr<cql3_type> cql3_type::boolean = make("boolean", boolean_
thread_local shared_ptr<cql3_type> cql3_type::double_ = make("double", double_type, cql3_type::kind::DOUBLE);
thread_local shared_ptr<cql3_type> cql3_type::float_ = make("float", float_type, cql3_type::kind::FLOAT);
thread_local shared_ptr<cql3_type> cql3_type::int_ = make("int", int32_type, cql3_type::kind::INT);
thread_local shared_ptr<cql3_type> cql3_type::smallint = make("smallint", short_type, cql3_type::kind::SMALLINT);
thread_local shared_ptr<cql3_type> cql3_type::text = make("text", utf8_type, cql3_type::kind::TEXT);
thread_local shared_ptr<cql3_type> cql3_type::timestamp = make("timestamp", timestamp_type, cql3_type::kind::TIMESTAMP);
thread_local shared_ptr<cql3_type> cql3_type::tinyint = make("tinyint", byte_type, cql3_type::kind::TINYINT);
thread_local shared_ptr<cql3_type> cql3_type::uuid = make("uuid", uuid_type, cql3_type::kind::UUID);
thread_local shared_ptr<cql3_type> cql3_type::varchar = make("varchar", utf8_type, cql3_type::kind::TEXT);
thread_local shared_ptr<cql3_type> cql3_type::timeuuid = make("timeuuid", timeuuid_type, cql3_type::kind::TIMEUUID);
thread_local shared_ptr<cql3_type> cql3_type::date = make("date", simple_date_type, cql3_type::kind::DATE);
thread_local shared_ptr<cql3_type> cql3_type::time = make("time", time_type, cql3_type::kind::TIME);
thread_local shared_ptr<cql3_type> cql3_type::inet = make("inet", inet_addr_type, cql3_type::kind::INET);
thread_local shared_ptr<cql3_type> cql3_type::varint = make("varint", varint_type, cql3_type::kind::VARINT);
thread_local shared_ptr<cql3_type> cql3_type::decimal = make("decimal", decimal_type, cql3_type::kind::DECIMAL);
@@ -296,12 +300,16 @@ cql3_type::values() {
cql3_type::float_,
cql3_type:inet,
cql3_type::int_,
cql3_type::smallint,
cql3_type::text,
cql3_type::timestamp,
cql3_type::tinyint,
cql3_type::uuid,
cql3_type::varchar,
cql3_type::varint,
cql3_type::timeuuid,
cql3_type::date,
cql3_type::time,
};
return v;
}

View File

@@ -98,7 +98,7 @@ private:
public:
enum class kind : int8_t {
ASCII, BIGINT, BLOB, BOOLEAN, COUNTER, DECIMAL, DOUBLE, FLOAT, INT, INET, TEXT, TIMESTAMP, UUID, VARCHAR, VARINT, TIMEUUID
ASCII, BIGINT, BLOB, BOOLEAN, COUNTER, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, TINYINT, INET, TEXT, TIMESTAMP, UUID, VARCHAR, VARINT, TIMEUUID, DATE, TIME
};
using kind_enum = super_enum<kind,
kind::ASCII,
@@ -111,12 +111,16 @@ public:
kind::FLOAT,
kind::INET,
kind::INT,
kind::SMALLINT,
kind::TINYINT,
kind::TEXT,
kind::TIMESTAMP,
kind::UUID,
kind::VARCHAR,
kind::VARINT,
kind::TIMEUUID>;
kind::TIMEUUID,
kind::DATE,
kind::TIME>;
using kind_enum_set = enum_set<kind_enum>;
private:
std::experimental::optional<kind_enum_set::prepared> _kind;
@@ -131,11 +135,15 @@ public:
static thread_local shared_ptr<cql3_type> double_;
static thread_local shared_ptr<cql3_type> float_;
static thread_local shared_ptr<cql3_type> int_;
static thread_local shared_ptr<cql3_type> smallint;
static thread_local shared_ptr<cql3_type> text;
static thread_local shared_ptr<cql3_type> timestamp;
static thread_local shared_ptr<cql3_type> tinyint;
static thread_local shared_ptr<cql3_type> uuid;
static thread_local shared_ptr<cql3_type> varchar;
static thread_local shared_ptr<cql3_type> timeuuid;
static thread_local shared_ptr<cql3_type> date;
static thread_local shared_ptr<cql3_type> time;
static thread_local shared_ptr<cql3_type> inet;
static thread_local shared_ptr<cql3_type> varint;
static thread_local shared_ptr<cql3_type> decimal;

View File

@@ -43,7 +43,7 @@
#include "types.hh"
#include <vector>
#include <iostream>
#include <iosfwd>
#include <boost/functional/hash.hpp>
namespace cql3 {

View File

@@ -59,13 +59,13 @@ public:
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
virtual void collect_marker_specification(shared_ptr<variable_specifications> bound_names) override;
virtual shared_ptr<terminal> bind(const query_options& options) override;
virtual bytes_view_opt bind_and_get(const query_options& options) override;
virtual cql3::raw_value_view bind_and_get(const query_options& options) override;
private:
static bytes_opt execute_internal(cql_serialization_format sf, scalar_function& fun, std::vector<bytes_opt> params);
public:
virtual bool contains_bind_marker() const override;
private:
static shared_ptr<terminal> make_terminal(shared_ptr<function> fun, bytes_opt result, cql_serialization_format sf);
static shared_ptr<terminal> make_terminal(shared_ptr<function> fun, cql3::raw_value result, cql_serialization_format sf);
public:
class raw : public term::raw {
function_name _name;

View File

@@ -43,7 +43,7 @@
#include "core/sstring.hh"
#include "db/system_keyspace.hh"
#include <iostream>
#include <iosfwd>
#include <functional>
namespace cql3 {

View File

@@ -67,6 +67,14 @@ functions::init() {
declare(aggregate_fcts::make_max_function<int64_t>());
declare(aggregate_fcts::make_min_function<int64_t>());
declare(aggregate_fcts::make_count_function<float>());
declare(aggregate_fcts::make_max_function<float>());
declare(aggregate_fcts::make_min_function<float>());
declare(aggregate_fcts::make_count_function<double>());
declare(aggregate_fcts::make_max_function<double>());
declare(aggregate_fcts::make_min_function<double>());
//FIXME:
//declare(aggregate_fcts::make_count_function<bytes>());
//declare(aggregate_fcts::make_max_function<bytes>());
@@ -78,15 +86,17 @@ functions::init() {
declare(make_blob_as_varchar_fct());
declare(aggregate_fcts::make_sum_function<int32_t>());
declare(aggregate_fcts::make_sum_function<int64_t>());
declare(aggregate_fcts::make_avg_function<int32_t>());
declare(aggregate_fcts::make_avg_function<int64_t>());
declare(aggregate_fcts::make_sum_function<float>());
declare(aggregate_fcts::make_sum_function<double>());
#if 0
declare(AggregateFcts.sumFunctionForFloat);
declare(AggregateFcts.sumFunctionForDouble);
declare(AggregateFcts.sumFunctionForDecimal);
declare(AggregateFcts.sumFunctionForVarint);
declare(AggregateFcts.avgFunctionForFloat);
declare(AggregateFcts.avgFunctionForDouble);
#endif
declare(aggregate_fcts::make_avg_function<int32_t>());
declare(aggregate_fcts::make_avg_function<int64_t>());
declare(aggregate_fcts::make_avg_function<float>());
declare(aggregate_fcts::make_avg_function<double>());
#if 0
declare(AggregateFcts.avgFunctionForVarint);
declare(AggregateFcts.avgFunctionForDecimal);
#endif
@@ -299,10 +309,10 @@ function_call::collect_marker_specification(shared_ptr<variable_specifications>
shared_ptr<terminal>
function_call::bind(const query_options& options) {
return make_terminal(_fun, to_bytes_opt(bind_and_get(options)), options.get_cql_serialization_format());
return make_terminal(_fun, cql3::raw_value::make_value(bind_and_get(options)), options.get_cql_serialization_format());
}
bytes_view_opt
cql3::raw_value_view
function_call::bind_and_get(const query_options& options) {
std::vector<bytes_opt> buffers;
buffers.reserve(_terms.size());
@@ -316,7 +326,7 @@ function_call::bind_and_get(const query_options& options) {
buffers.push_back(std::move(to_bytes_opt(val)));
}
auto result = execute_internal(options.get_cql_serialization_format(), *_fun, std::move(buffers));
return options.make_temporary(result);
return options.make_temporary(cql3::raw_value::make_value(result));
}
bytes_opt
@@ -347,7 +357,7 @@ function_call::contains_bind_marker() const {
}
shared_ptr<terminal>
function_call::make_terminal(shared_ptr<function> fun, bytes_opt result, cql_serialization_format sf) {
function_call::make_terminal(shared_ptr<function> fun, cql3::raw_value result, cql_serialization_format sf) {
if (!dynamic_pointer_cast<const collection_type_impl>(fun->return_type())) {
return ::make_shared<constants::value>(std::move(result));
}
@@ -413,7 +423,7 @@ function_call::raw::prepare(database& db, const sstring& keyspace, ::shared_ptr<
// If all parameters are terminal and the function is pure, we can
// evaluate it now, otherwise we'd have to wait execution time
if (all_terminal && scalar_fun->is_pure()) {
return make_terminal(scalar_fun, execute(*scalar_fun, parameters), query_options::DEFAULT.get_cql_serialization_format());
return make_terminal(scalar_fun, cql3::raw_value::make_value(execute(*scalar_fun, parameters)), query_options::DEFAULT.get_cql_serialization_format());
} else {
return ::make_shared<function_call>(scalar_fun, parameters);
}
@@ -426,7 +436,7 @@ function_call::raw::execute(scalar_function& fun, std::vector<shared_ptr<term>>
for (auto&& t : parameters) {
assert(dynamic_cast<terminal*>(t.get()));
auto&& param = static_cast<terminal*>(t.get())->get(query_options::DEFAULT);
buffers.push_back(std::move(param));
buffers.push_back(std::move(to_bytes_opt(param)));
}
return execute_internal(cql_serialization_format::internal(), fun, buffers);

View File

@@ -133,9 +133,9 @@ lists::value::from_serialized(bytes_view v, list_type type, cql_serialization_fo
}
}
bytes_opt
cql3::raw_value
lists::value::get(const query_options& options) {
return get_with_protocol_version(options.get_cql_serialization_format());
return cql3::raw_value::make_value(get_with_protocol_version(options.get_cql_serialization_format()));
}
bytes
@@ -196,10 +196,12 @@ lists::delayed_value::bind(const query_options& options) {
for (auto&& t : _elements) {
auto bo = t->bind_and_get(options);
if (!bo) {
if (bo.is_null()) {
throw exceptions::invalid_request_exception("null is not supported inside collections");
}
if (bo.is_unset_value()) {
return constants::UNSET_VALUE;
}
// We don't support value > 64K because the serialization format encode the length as an unsigned short.
if (bo->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("List value is too long. List values are limited to %d bytes but %d bytes value provided",
@@ -216,8 +218,10 @@ lists::delayed_value::bind(const query_options& options) {
lists::marker::bind(const query_options& options) {
const auto& value = options.get_value_at(_bind_index);
auto ltype = static_pointer_cast<const list_type_impl>(_receiver->type);
if (!value) {
if (value.is_null()) {
return nullptr;
} else if (value.is_unset_value()) {
return constants::UNSET_VALUE;
} else {
return make_shared(value::from_serialized(*value, std::move(ltype), options.get_cql_serialization_format()));
}
@@ -239,6 +243,10 @@ lists::precision_time::get_next(db_clock::time_point millis) {
void
lists::setter::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
const auto& value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
}
if (column.type->is_multi_cell()) {
// delete + append
collection_type_impl::mutation mut;
@@ -247,7 +255,7 @@ lists::setter::execute(mutation& m, const exploded_clustering_prefix& prefix, co
auto col_mut = ctype->serialize_mutation_form(std::move(mut));
m.set_cell(prefix, column, std::move(col_mut));
}
do_append(_t, m, prefix, column, params);
do_append(value, m, prefix, column, params);
}
bool
@@ -272,11 +280,16 @@ lists::setter_by_index::execute(mutation& m, const exploded_clustering_prefix& p
}
auto index = _idx->bind_and_get(params._options);
auto value = _t->bind_and_get(params._options);
if (!index) {
if (index.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for list index");
}
if (index.is_unset_value()) {
throw exceptions::invalid_request_exception("Invalid unset value for list index");
}
auto value = _t->bind_and_get(params._options);
if (value.is_unset_value()) {
return;
}
auto idx = net::ntoh(int32_t(*unaligned_cast<int32_t>(index->begin())));
auto&& existing_list_opt = params.get_prefetched_list(m.key(), std::move(row_key), column);
@@ -343,23 +356,26 @@ lists::setter_by_uuid::execute(mutation& m, const exploded_clustering_prefix& pr
void
lists::appender::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
const auto& value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
}
assert(column.type->is_multi_cell()); // "Attempted to append to a frozen list";
do_append(_t, m, prefix, column, params);
do_append(value, m, prefix, column, params);
}
void
lists::do_append(shared_ptr<term> t,
lists::do_append(shared_ptr<term> value,
mutation& m,
const exploded_clustering_prefix& prefix,
const column_definition& column,
const update_parameters& params) {
auto&& value = t->bind(params._options);
auto&& list_value = dynamic_pointer_cast<lists::value>(value);
auto&& ltype = dynamic_pointer_cast<const list_type_impl>(column.type);
if (column.type->is_multi_cell()) {
// If we append null, do nothing. Note that for Setter, we've
// already removed the previous value so we're good here too
if (!value) {
if (!value || value == constants::UNSET_VALUE) {
return;
}
@@ -388,7 +404,7 @@ void
lists::prepender::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to prepend to a frozen list";
auto&& value = _t->bind(params._options);
if (!value) {
if (!value || value == constants::UNSET_VALUE) {
return;
}
@@ -441,7 +457,7 @@ lists::discarder::execute(mutation& m, const exploded_clustering_prefix& prefix,
return;
}
if (!value) {
if (!value || value == constants::UNSET_VALUE) {
return;
}
@@ -480,6 +496,9 @@ lists::discarder_by_index::execute(mutation& m, const exploded_clustering_prefix
if (!index) {
throw exceptions::invalid_request_exception("Invalid null value for list index");
}
if (index == constants::UNSET_VALUE) {
return;
}
auto ltype = static_pointer_cast<const list_type_impl>(column.type);
auto cvalue = dynamic_pointer_cast<constants::value>(index);

View File

@@ -80,7 +80,7 @@ public:
: _elements(std::move(elements)) {
}
static value from_serialized(bytes_view v, list_type type, cql_serialization_format sf);
virtual bytes_opt get(const query_options& options) override;
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf) override;
bool equals(shared_ptr<list_type_impl> lt, const value& v);
virtual std::vector<bytes_opt> get_elements() override;
@@ -176,7 +176,7 @@ public:
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
};
static void do_append(shared_ptr<term> t,
static void do_append(shared_ptr<term> value,
mutation& m,
const exploded_clustering_prefix& prefix,
const column_definition& column,

View File

@@ -169,9 +169,9 @@ maps::value::from_serialized(bytes_view value, map_type type, cql_serialization_
}
}
bytes_opt
cql3::raw_value
maps::value::get(const query_options& options) {
return get_with_protocol_version(options.get_cql_serialization_format());
return cql3::raw_value::make_value(get_with_protocol_version(options.get_cql_serialization_format()));
}
bytes
@@ -227,18 +227,24 @@ maps::delayed_value::bind(const query_options& options) {
// We don't support values > 64K because the serialization format encode the length as an unsigned short.
auto key_bytes = key->bind_and_get(options);
if (!key_bytes) {
if (key_bytes.is_null()) {
throw exceptions::invalid_request_exception("null is not supported inside collections");
}
if (key_bytes.is_unset_value()) {
throw exceptions::invalid_request_exception("unset value is not supported inside collections");
}
if (key_bytes->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Map key is too long. Map keys are limited to %d bytes but %d bytes keys provided",
std::numeric_limits<uint16_t>::max(),
key_bytes->size()));
}
auto value_bytes = value->bind_and_get(options);
if (!value_bytes) {
if (value_bytes.is_null()) {
throw exceptions::invalid_request_exception("null is not supported inside collections");\
}
if (value_bytes.is_unset_value()) {
return constants::UNSET_VALUE;
}
if (value_bytes->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Map value is too long. Map values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
@@ -252,17 +258,22 @@ maps::delayed_value::bind(const query_options& options) {
::shared_ptr<terminal>
maps::marker::bind(const query_options& options) {
auto val = options.get_value_at(_bind_index);
return val ?
::make_shared<maps::value>(
maps::value::from_serialized(*val,
static_pointer_cast<const map_type_impl>(
_receiver->type),
options.get_cql_serialization_format())) :
nullptr;
if (val.is_null()) {
return nullptr;
}
if (val.is_unset_value()) {
return constants::UNSET_VALUE;
}
return ::make_shared<maps::value>(maps::value::from_serialized(*val, static_pointer_cast<const map_type_impl>(_receiver->type),
options.get_cql_serialization_format()));
}
void
maps::setter::execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) {
auto value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
}
if (column.type->is_multi_cell()) {
// delete + put
collection_type_impl::mutation mut;
@@ -271,7 +282,7 @@ maps::setter::execute(mutation& m, const exploded_clustering_prefix& row_key, co
auto col_mut = ctype->serialize_mutation_form(std::move(mut));
m.set_cell(row_key, column, std::move(col_mut));
}
do_put(m, row_key, params, _t, column);
do_put(m, row_key, params, value, column);
}
void
@@ -306,13 +317,15 @@ maps::setter_by_key::execute(mutation& m, const exploded_clustering_prefix& pref
void
maps::putter::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to add items to a frozen map";
do_put(m, prefix, params, _t, column);
auto value = _t->bind(params._options);
if (value != constants::UNSET_VALUE) {
do_put(m, prefix, params, value, column);
}
}
void
maps::do_put(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params,
shared_ptr<term> t, const column_definition& column) {
auto value = t->bind(params._options);
shared_ptr<term> value, const column_definition& column) {
auto map_value = dynamic_pointer_cast<maps::value>(value);
if (column.type->is_multi_cell()) {
collection_type_impl::mutation mut;
@@ -346,6 +359,9 @@ maps::discarder_by_key::execute(mutation& m, const exploded_clustering_prefix& p
if (!key) {
throw exceptions::invalid_request_exception("Invalid null map key");
}
if (key == constants::UNSET_VALUE) {
throw exceptions::invalid_request_exception("Invalid unset map key");
}
collection_type_impl::mutation mut;
mut.cells.emplace_back(*key->get(params._options), params.make_dead_cell());
auto mtype = static_cast<const map_type_impl*>(column.type.get());

View File

@@ -82,7 +82,7 @@ public:
: map(std::move(map)) {
}
static value from_serialized(bytes_view value, map_type type, cql_serialization_format sf);
virtual bytes_opt get(const query_options& options) override;
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf);
bool equals(map_type mt, const value& v);
virtual sstring to_string() const;
@@ -138,7 +138,7 @@ public:
};
static void do_put(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params,
shared_ptr<term> t, const column_definition& column);
shared_ptr<term> value, const column_definition& column);
class discarder_by_key : public operation {
public:

View File

@@ -184,6 +184,13 @@ protected:
throw exceptions::invalid_request_exception(sprint("%s cannot be used for Multi-column relations", get_operator()));
}
virtual ::shared_ptr<relation> maybe_rename_identifier(const column_identifier::raw& from, column_identifier::raw to) override {
auto new_entities = boost::copy_range<decltype(_entities)>(_entities | boost::adaptors::transformed([&] (auto&& entity) {
return *entity == from ? ::make_shared<column_identifier::raw>(to) : entity;
}));
return ::make_shared(multi_column_relation(std::move(new_entities), _relation_type, _values_or_marker, _in_values, _in_marker));
}
virtual shared_ptr<term> to_term(const std::vector<shared_ptr<column_specification>>& receivers,
::shared_ptr<term::raw> raw, database& db, const sstring& keyspace,
::shared_ptr<variable_specifications> bound_names) override {

View File

@@ -88,13 +88,10 @@ operation::addition::prepare(database& db, const sstring& keyspace, const column
auto ctype = dynamic_pointer_cast<const collection_type_impl>(receiver.type);
if (!ctype) {
fail(unimplemented::cause::COUNTERS);
// FIXME: implelement
#if 0
if (!(receiver.type instanceof CounterColumnType))
throw new InvalidRequestException(String.format("Invalid operation (%s) for non counter column %s", toString(receiver), receiver.name));
return new Constants.Adder(receiver, v);
#endif
if (!receiver.is_counter()) {
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for non counter column %s", receiver, receiver.name()));
}
return make_shared<constants::adder>(receiver, v);
} else if (!ctype->is_multi_cell()) {
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for frozen collection column %s", receiver, receiver.name()));
}
@@ -119,12 +116,11 @@ shared_ptr<operation>
operation::subtraction::prepare(database& db, const sstring& keyspace, const column_definition& receiver) {
auto ctype = dynamic_pointer_cast<const collection_type_impl>(receiver.type);
if (!ctype) {
fail(unimplemented::cause::COUNTERS);
#if 0
if (!(receiver.type instanceof CounterColumnType))
throw new InvalidRequestException(String.format("Invalid operation (%s) for non counter column %s", toString(receiver), receiver.name));
return new Constants.Substracter(receiver, value.prepare(keyspace, receiver));
#endif
if (!receiver.is_counter()) {
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for non counter column %s", receiver, receiver.name()));
}
auto v = _value->prepare(db, keyspace, receiver.column_specification);
return make_shared<constants::subtracter>(receiver, v);
}
if (!ctype->is_multi_cell()) {
throw exceptions::invalid_request_exception(

View File

@@ -95,6 +95,10 @@ public:
return params.make_cell(value);
}
atomic_cell make_counter_update_cell(int64_t delta, const update_parameters& params) const {
return params.make_counter_update_cell(delta);
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const {
return _t && _t->uses_function(ks_name, function_name);
}

View File

@@ -52,5 +52,6 @@ const operator_type operator_type::IN(7, operator_type::IN, "IN");
const operator_type operator_type::CONTAINS(5, operator_type::CONTAINS, "CONTAINS");
const operator_type operator_type::CONTAINS_KEY(6, operator_type::CONTAINS_KEY, "CONTAINS_KEY");
const operator_type operator_type::NEQ(8, operator_type::NEQ, "!=");
const operator_type operator_type::IS_NOT(9, operator_type::IS_NOT, "IS NOT");
}

View File

@@ -42,7 +42,7 @@
#pragma once
#include <cstddef>
#include <iostream>
#include <iosfwd>
#include "core/sstring.hh"
namespace cql3 {
@@ -58,6 +58,7 @@ public:
static const operator_type CONTAINS;
static const operator_type CONTAINS_KEY;
static const operator_type NEQ;
static const operator_type IS_NOT;
private:
int32_t _b;
const operator_type& _reverse;

View File

@@ -47,11 +47,11 @@ namespace cql3 {
thread_local const query_options::specific_options query_options::specific_options::DEFAULT{-1, {}, {}, api::missing_timestamp};
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, std::experimental::nullopt,
std::vector<bytes_view_opt>(), false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
std::vector<cql3::raw_value_view>(), false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
query_options::query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<bytes_opt> values,
std::vector<cql3::raw_value> values,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
@@ -68,7 +68,7 @@ query_options::query_options(db::consistency_level consistency,
query_options::query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<bytes_view_opt> value_views,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
@@ -82,7 +82,7 @@ query_options::query_options(db::consistency_level consistency,
{
}
query_options::query_options(query_options&& o, std::vector<std::vector<bytes_view_opt>> value_views)
query_options::query_options(query_options&& o, std::vector<std::vector<cql3::raw_value_view>> value_views)
: query_options(std::move(o))
{
std::vector<query_options> tmp;
@@ -93,7 +93,7 @@ query_options::query_options(query_options&& o, std::vector<std::vector<bytes_vi
_batch_options = std::move(tmp);
}
query_options::query_options(db::consistency_level cl, std::vector<bytes_opt> values)
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values)
: query_options(
cl,
{},
@@ -105,7 +105,7 @@ query_options::query_options(db::consistency_level cl, std::vector<bytes_opt> va
{
}
query_options::query_options(std::vector<bytes_opt> values)
query_options::query_options(std::vector<cql3::raw_value> values)
: query_options(
db::consistency_level::ONE, std::move(values))
{}
@@ -115,7 +115,7 @@ db::consistency_level query_options::get_consistency() const
return _consistency;
}
bytes_view_opt query_options::get_value_at(size_t idx) const
cql3::raw_value_view query_options::get_value_at(size_t idx) const
{
return _value_views.at(idx);
}
@@ -125,14 +125,14 @@ size_t query_options::get_values_count() const
return _value_views.size();
}
bytes_view_opt query_options::make_temporary(bytes_opt value) const
cql3::raw_value_view query_options::make_temporary(cql3::raw_value value) const
{
if (value) {
_temporaries.emplace_back(value->begin(), value->end());
auto& temporary = _temporaries.back();
return bytes_view{temporary.data(), temporary.size()};
return cql3::raw_value_view::make_value(bytes_view{temporary.data(), temporary.size()});
}
return std::experimental::nullopt;
return cql3::raw_value_view::make_null();
}
bool query_options::skip_metadata() const
@@ -192,7 +192,7 @@ void query_options::prepare(const std::vector<::shared_ptr<column_specification>
}
auto& names = *_names;
std::vector<bytes_opt> ordered_values;
std::vector<cql3::raw_value> ordered_values;
ordered_values.reserve(specs.size());
for (auto&& spec : specs) {
auto& spec_name = spec->name->text();
@@ -211,9 +211,9 @@ void query_options::fill_value_views()
{
for (auto&& value : _values) {
if (value) {
_value_views.emplace_back(bytes_view{*value});
_value_views.emplace_back(cql3::raw_value_view::make_value(bytes_view{*value}));
} else {
_value_views.emplace_back(std::experimental::nullopt);
_value_views.emplace_back(cql3::raw_value_view::make_null());
}
}
}

View File

@@ -48,6 +48,7 @@
#include "service/pager/paging_state.hh"
#include "cql3/column_specification.hh"
#include "cql3/column_identifier.hh"
#include "cql3/values.hh"
#include "cql_serialization_format.hh"
namespace cql3 {
@@ -69,8 +70,8 @@ public:
private:
const db::consistency_level _consistency;
const std::experimental::optional<std::vector<sstring_view>> _names;
std::vector<bytes_opt> _values;
std::vector<bytes_view_opt> _value_views;
std::vector<cql3::raw_value> _values;
std::vector<cql3::raw_value_view> _value_views;
mutable std::vector<std::vector<int8_t>> _temporaries;
const bool _skip_metadata;
const specific_options _options;
@@ -82,30 +83,30 @@ public:
explicit query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<bytes_opt> values,
std::vector<cql3::raw_value> values,
bool skip_metadata,
specific_options options,
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<bytes_view_opt> value_views,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
specific_options options,
cql_serialization_format sf);
// Batch query_options constructor
explicit query_options(query_options&&, std::vector<std::vector<bytes_view_opt>> value_views);
explicit query_options(query_options&&, std::vector<std::vector<cql3::raw_value_view>> value_views);
// It can't be const because of prepare()
static thread_local query_options DEFAULT;
// forInternalUse
explicit query_options(std::vector<bytes_opt> values);
explicit query_options(db::consistency_level, std::vector<bytes_opt> values);
explicit query_options(std::vector<cql3::raw_value> values);
explicit query_options(db::consistency_level, std::vector<cql3::raw_value> values);
db::consistency_level get_consistency() const;
bytes_view_opt get_value_at(size_t idx) const;
bytes_view_opt make_temporary(bytes_opt value) const;
cql3::raw_value_view get_value_at(size_t idx) const;
cql3::raw_value_view make_temporary(cql3::raw_value value) const;
size_t get_values_count() const;
bool skip_metadata() const;
/** The pageSize for this query. Will be <= 0 if not relevant for the query. */

View File

@@ -38,11 +38,13 @@
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <seastar/core/metrics.hh>
#include "cql3/query_processor.hh"
#include "cql3/CqlParser.hpp"
#include "cql3/error_collector.hh"
#include "cql3/statements/batch_statement.hh"
#include "cql3/util.hh"
#include "transport/messages/result_message.hh"
@@ -58,7 +60,7 @@ logging::logger log("query_processor");
distributed<query_processor> _the_query_processor;
const sstring query_processor::CQL_VERSION = "3.2.1";
const sstring query_processor::CQL_VERSION = "3.3.1";
class query_processor::internal_state {
service::query_state _qs;
@@ -94,11 +96,42 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy,
, _db(db)
, _internal_state(new internal_state())
{
_collectd_regs.push_back(
scollectd::add_polled_metric(scollectd::type_instance_id("query_processor"
, scollectd::per_cpu_plugin_instance
, "total_operations", "statements_prepared")
, scollectd::make_typed(scollectd::data_type::DERIVE, _stats.prepare_invocations)));
namespace sm = seastar::metrics;
_metrics.add_group("query_processor", {
sm::make_derive("statements_prepared", _stats.prepare_invocations,
sm::description("Counts a total number of parsed CQL requests.")),
});
_metrics.add_group("cql", {
sm::make_derive("reads", _cql_stats.reads,
sm::description("Counts a total number of CQL read requests.")),
sm::make_derive("inserts", _cql_stats.inserts,
sm::description("Counts a total number of CQL INSERT requests.")),
sm::make_derive("updates", _cql_stats.updates,
sm::description("Counts a total number of CQL UPDATE requests.")),
sm::make_derive("deletes", _cql_stats.deletes,
sm::description("Counts a total number of CQL DELETE requests.")),
sm::make_derive("batches", _cql_stats.batches,
sm::description("Counts a total number of CQL BATCH requests.")),
sm::make_derive("statements_in_batches", _cql_stats.statements_in_batches,
sm::description("Counts a total number of sub-statements in CQL BATCH requests.")),
sm::make_derive("batches_pure_logged", _cql_stats.batches_pure_logged,
sm::description("Counts a total number of LOGGED batches that were executed as LOGGED batches.")),
sm::make_derive("batches_pure_unlogged", _cql_stats.batches_pure_unlogged,
sm::description("Counts a total number of UNLOGGED batches that were executed as UNLOGGED batches.")),
sm::make_derive("batches_unlogged_from_logged", _cql_stats.batches_unlogged_from_logged,
sm::description("Counts a total number of LOGGED batches that were executed as UNLOGGED batches.")),
});
service::get_local_migration_manager().register_listener(_migration_subscriber.get());
}
@@ -285,31 +318,18 @@ query_processor::get_statement(const sstring_view& query, const service::client_
Tracing.trace("Preparing statement");
#endif
++_stats.prepare_invocations;
return statement->prepare(_db.local());
return statement->prepare(_db.local(), _cql_stats);
}
::shared_ptr<raw::parsed_statement>
query_processor::parse_statement(const sstring_view& query)
{
try {
cql3_parser::CqlLexer::collector_type lexer_error_collector(query);
cql3_parser::CqlParser::collector_type parser_error_collector(query);
cql3_parser::CqlLexer::InputStreamType input{reinterpret_cast<const ANTLR_UINT8*>(query.begin()), ANTLR_ENC_UTF8, static_cast<ANTLR_UINT32>(query.size()), nullptr};
cql3_parser::CqlLexer lexer{&input};
lexer.set_error_listener(lexer_error_collector);
cql3_parser::CqlParser::TokenStreamType tstream(ANTLR_SIZE_HINT, lexer.get_tokSource());
cql3_parser::CqlParser parser{&tstream};
parser.set_error_listener(parser_error_collector);
auto statement = parser.query();
lexer_error_collector.throw_first_syntax_error();
parser_error_collector.throw_first_syntax_error();
auto statement = util::do_with_parser(query, std::mem_fn(&cql3_parser::CqlParser::query));
if (!statement) {
throw exceptions::syntax_exception("Parsing failed");
}
return std::move(statement);
return statement;
} catch (const exceptions::recognition_exception& e) {
throw exceptions::syntax_exception(sprint("Invalid or malformed CQL query string: %s", e.what()));
} catch (const exceptions::cassandra_exception& e) {
@@ -328,15 +348,15 @@ query_options query_processor::make_internal_options(::shared_ptr<statements::pr
throw std::invalid_argument(sprint("Invalid number of values. Expecting %d but got %d", p->bound_names.size(), values.size()));
}
auto ni = p->bound_names.begin();
std::vector<bytes_opt> bound_values;
std::vector<cql3::raw_value> bound_values;
for (auto& v : values) {
auto& n = *ni++;
if (v.type() == bytes_type) {
bound_values.push_back({value_cast<bytes>(v)});
bound_values.push_back(cql3::raw_value::make_value(value_cast<bytes>(v)));
} else if (v.is_null()) {
bound_values.push_back({});
bound_values.push_back(cql3::raw_value::make_null());
} else {
bound_values.push_back({n->type->decompose(v)});
bound_values.push_back(cql3::raw_value::make_value(n->type->decompose(v)));
}
}
return query_options(cl, bound_values);
@@ -346,7 +366,7 @@ query_options query_processor::make_internal_options(::shared_ptr<statements::pr
{
auto& p = _internal_statements[query_string];
if (p == nullptr) {
auto np = parse_statement(query_string)->prepare(_db.local());
auto np = parse_statement(query_string)->prepare(_db.local(), _cql_stats);
np->statement->validate(_proxy, *_internal_state);
p = std::move(np); // inserts it into map
}
@@ -382,7 +402,7 @@ query_processor::process(const sstring& query_string,
const std::initializer_list<data_value>& values,
bool cache)
{
auto p = cache ? prepare_internal(query_string) : parse_statement(query_string)->prepare(_db.local());
auto p = cache ? prepare_internal(query_string) : parse_statement(query_string)->prepare(_db.local(), _cql_stats);
if (!cache) {
p->statement->validate(_proxy, *_internal_state);
}
@@ -441,6 +461,10 @@ void query_processor::migration_subscriber::on_create_aggregate(const sstring& k
log.warn("{} event ignored", __func__);
}
void query_processor::migration_subscriber::on_create_view(const sstring& ks_name, const sstring& view_name)
{
}
void query_processor::migration_subscriber::on_update_keyspace(const sstring& ks_name)
{
}
@@ -464,6 +488,10 @@ void query_processor::migration_subscriber::on_update_aggregate(const sstring& k
{
}
void query_processor::migration_subscriber::on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed)
{
}
void query_processor::migration_subscriber::on_drop_keyspace(const sstring& ks_name)
{
remove_invalid_prepared_statements(ks_name, std::experimental::nullopt);
@@ -488,6 +516,10 @@ void query_processor::migration_subscriber::on_drop_aggregate(const sstring& ks_
log.warn("{} event ignored", __func__);
}
void query_processor::migration_subscriber::on_drop_view(const sstring& ks_name, const sstring& view_name)
{
}
void query_processor::migration_subscriber::remove_invalid_prepared_statements(sstring ks_name, std::experimental::optional<sstring> cf_name)
{
_qp->invalidate_prepared_statements([&] (::shared_ptr<cql_statement> stmt) {

View File

@@ -43,6 +43,7 @@
#include <experimental/string_view>
#include <unordered_map>
#include <seastar/core/metrics_registration.hh>
#include "core/shared_ptr.hh"
#include "exceptions/exceptions.hh"
@@ -75,7 +76,9 @@ private:
uint64_t prepare_invocations = 0;
} _stats;
std::vector<scollectd::registration> _collectd_regs;
cql_stats _cql_stats;
seastar::metrics::metric_groups _metrics;
class internal_state;
std::unique_ptr<internal_state> _internal_state;
@@ -92,6 +95,11 @@ public:
distributed<service::storage_proxy>& proxy() {
return _proxy;
}
cql_stats& get_cql_stats() {
return _cql_stats;
}
#if 0
public static final QueryProcessor instance = new QueryProcessor();
#endif
@@ -518,18 +526,21 @@ public:
virtual void on_create_user_type(const sstring& ks_name, const sstring& type_name) override;
virtual void on_create_function(const sstring& ks_name, const sstring& function_name) override;
virtual void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override;
virtual void on_create_view(const sstring& ks_name, const sstring& view_name) override;
virtual void on_update_keyspace(const sstring& ks_name) override;
virtual void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool columns_changed) override;
virtual void on_update_user_type(const sstring& ks_name, const sstring& type_name) override;
virtual void on_update_function(const sstring& ks_name, const sstring& function_name) override;
virtual void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override;
virtual void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override;
virtual void on_drop_keyspace(const sstring& ks_name) override;
virtual void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override;
virtual void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override;
virtual void on_drop_function(const sstring& ks_name, const sstring& function_name) override;
virtual void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override;
virtual void on_drop_view(const sstring& ks_name, const sstring& view_name) override;
private:
void remove_invalid_prepared_statements(sstring ks_name, std::experimental::optional<sstring> cf_name);
bool should_invalidate(sstring ks_name, std::experimental::optional<sstring> cf_name, ::shared_ptr<cql_statement> statement);

View File

@@ -156,6 +156,10 @@ public:
return new_contains_restriction(db, schema, bound_names, false);
} else if (_relation_type == operator_type::CONTAINS_KEY) {
return new_contains_restriction(db, schema, bound_names, true);
} else if (_relation_type == operator_type::IS_NOT) {
// This case is not supposed to happen: statement_restrictions
// constructor does not call this function for views' IS_NOT.
throw exceptions::invalid_request_exception(sprint("Unsupported \"IS NOT\" relation: %s", to_string()));
} else {
throw exceptions::invalid_request_exception(sprint("Unsupported \"!=\" relation: %s", to_string()));
}
@@ -216,6 +220,15 @@ public:
virtual ::shared_ptr<restrictions::restriction> new_contains_restriction(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, bool isKey) = 0;
/**
* Renames an identifier in this Relation, if applicable.
* @param from the old identifier
* @param to the new identifier
* @return a pointer object, if the old identifier is not in the set of entities that this relation covers;
* otherwise a new Relation with "from" replaced by "to" is returned.
*/
virtual ::shared_ptr<relation> maybe_rename_identifier(const column_identifier::raw& from, column_identifier::raw to) = 0;
protected:
/**

View File

@@ -67,7 +67,7 @@ template<typename ValueType>
struct range_type_for;
template<>
struct range_type_for<partition_key> : public std::remove_reference<query::partition_range> {};
struct range_type_for<partition_key> : public std::remove_reference<dht::partition_range> {};
template<>
struct range_type_for<clustering_key_prefix> : public std::remove_reference<query::clustering_range> {};

View File

@@ -115,7 +115,7 @@ public:
if (restriction->is_slice()) {
throw exceptions::invalid_request_exception(sprint(
"PRIMARY KEY column \"%s\" cannot be restricted (preceding column \"%s\" is restricted by a non-EQ relation)",
_restrictions->next_column(new_column)->name_as_text(), new_column.name_as_text()));
last_column.name_as_text(), new_column.name_as_text()));
}
}
@@ -334,9 +334,9 @@ public:
};
template<>
std::vector<query::partition_range>
dht::partition_range_vector
single_column_primary_key_restrictions<partition_key>::bounds_ranges(const query_options& options) const {
std::vector<query::partition_range> ranges;
dht::partition_range_vector ranges;
ranges.reserve(size());
for (query::range<partition_key>& r : compute_bounds(options)) {
if (!r.is_singular()) {

View File

@@ -28,6 +28,9 @@
#include "single_column_primary_key_restrictions.hh"
#include "token_restriction.hh"
#include "cql3/single_column_relation.hh"
#include "cql3/constants.hh"
namespace cql3 {
namespace restrictions {
@@ -131,13 +134,21 @@ statement_restrictions::statement_restrictions(schema_ptr schema)
, _clustering_columns_restrictions(get_initial_key_restrictions<clustering_key_prefix>())
, _nonprimary_key_restrictions(::make_shared<single_column_restrictions>(schema))
{ }
#if 0
static const column_definition*
to_column_definition(const schema_ptr& schema, const ::shared_ptr<column_identifier::raw>& entity) {
return get_column_definition(schema,
*entity->prepare_column_identifier(schema));
}
#endif
statement_restrictions::statement_restrictions(database& db,
schema_ptr schema,
const std::vector<::shared_ptr<relation>>& where_clause,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection)
bool select_a_collection,
bool for_view)
: statement_restrictions(schema)
{
/*
@@ -149,7 +160,31 @@ statement_restrictions::statement_restrictions(database& db,
*/
if (!where_clause.empty()) {
for (auto&& relation : where_clause) {
add_restriction(relation->to_restriction(db, schema, bound_names));
if (relation->get_operator() == cql3::operator_type::IS_NOT) {
single_column_relation* r =
dynamic_cast<single_column_relation*>(relation.get());
// The "IS NOT NULL" restriction is only supported (and
// mandatory) for materialized view creation:
if (!r) {
throw exceptions::invalid_request_exception("IS NOT only supports single column");
}
// currently, the grammar only allows the NULL argument to be
// "IS NOT", so this assertion should not be able to fail
assert(r->get_value() == cql3::constants::NULL_LITERAL);
auto col_id = r->get_entity()->prepare_column_identifier(schema);
const auto *cd = get_column_definition(schema, *col_id);
if (!cd) {
throw exceptions::invalid_request_exception(sprint("restriction '%s' unknown column %s", relation->to_string(), r->get_entity()->to_string()));
}
_not_null_columns.insert(cd);
if (!for_view) {
throw exceptions::invalid_request_exception(sprint("restriction '%s' is only supported in materialized view creation", relation->to_string()));
}
} else {
add_restriction(relation->to_restriction(db, schema, bound_names));
}
}
}
@@ -317,9 +352,9 @@ void statement_restrictions::process_clustering_columns_restrictions(bool has_qu
}
}
std::vector<query::partition_range> statement_restrictions::get_partition_key_ranges(const query_options& options) const {
dht::partition_range_vector statement_restrictions::get_partition_key_ranges(const query_options& options) const {
if (_partition_key_restrictions->empty()) {
return {query::partition_range::make_open_ended_both_sides()};
return {dht::partition_range::make_open_ended_both_sides()};
}
return _partition_key_restrictions->bounds_ranges(options);
}

View File

@@ -83,6 +83,8 @@ private:
*/
::shared_ptr<single_column_restrictions> _nonprimary_key_restrictions;
std::unordered_set<const column_definition*> _not_null_columns;
/**
* The restrictions used to build the index expressions
*/
@@ -112,7 +114,8 @@ public:
const std::vector<::shared_ptr<relation>>& where_clause,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection);
bool select_a_collection,
bool for_view = false);
private:
void add_restriction(::shared_ptr<restriction> restriction);
void add_single_column_restriction(::shared_ptr<single_column_restriction> restriction);
@@ -171,6 +174,20 @@ private:
*/
void process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection);
/**
* Returns the <code>Restrictions</code> for the specified type of columns.
*
* @param kind the column type
* @return the <code>restrictions</code> for the specified type of columns
*/
::shared_ptr<restrictions> get_restrictions(column_kind kind) const {
switch (kind) {
case column_kind::partition_key: return _partition_key_restrictions;
case column_kind::clustering_key: return _clustering_columns_restrictions;
default: return _nonprimary_key_restrictions;
}
}
#if 0
std::vector<::shared_ptr<index_expression>> get_index_expressions(const query_options& options) {
if (!_uses_secondary_indexing || _index_restrictions.empty()) {
@@ -208,7 +225,7 @@ public:
* @return the specified bound of the partition key
* @throws InvalidRequestException if the boundary cannot be retrieved
*/
std::vector<query::partition_range> get_partition_key_ranges(const query_options& options) const;
dht::partition_range_vector get_partition_key_ranges(const query_options& options) const;
#if 0
/**
@@ -346,9 +363,21 @@ public:
* @return <code>true</code> if the query has some restrictions on the clustering columns,
* <code>false</code> otherwise.
*/
bool has_clustering_columns_restriction() {
bool has_clustering_columns_restriction() const {
return !_clustering_columns_restrictions->empty();
}
/**
* @return true if column is restricted by some restriction, false otherwise
*/
bool is_restricted(const column_definition* cdef) const {
if (_not_null_columns.find(cdef) != _not_null_columns.end()) {
return true;
}
auto&& restricted = get_restrictions(cdef->kind).get()->get_column_defs();
return std::find(restricted.begin(), restricted.end(), cdef) != restricted.end();
}
};
}

View File

@@ -97,7 +97,14 @@ public:
if (!buf) {
throw exceptions::invalid_request_exception("Invalid null token value");
}
return dht::token(dht::token::kind::key, *buf);
auto tk = dht::global_partitioner().from_bytes(*buf);
if (tk.is_minimum() && !is_start(b)) {
// The token was parsed as a minimum marker (token::kind::before_all_keys), but
// as it appears in the end bound position, it is actually the maximum marker
// (token::kind::after_all_keys).
return dht::maximum_token();
}
return tk;
};
const auto start_token = get_token_bound(statements::bound::START);

View File

@@ -253,7 +253,7 @@ selection::collect_metadata(schema_ptr schema, const std::vector<::shared_ptr<ra
return r;
}
result_set_builder::result_set_builder(const selection& s, db_clock::time_point now, cql_serialization_format sf)
result_set_builder::result_set_builder(const selection& s, gc_clock::time_point now, cql_serialization_format sf)
: _result_set(std::make_unique<result_set>(::make_shared<metadata>(*(s.get_result_metadata()))))
, _selectors(s.new_selectors())
, _now(now)
@@ -290,7 +290,7 @@ void result_set_builder::add(const column_definition& def, const query::result_a
gc_clock::duration ttl_left(-1);
expiry_opt e = c.expiry();
if (e) {
ttl_left = *e - to_gc_clock(_now);
ttl_left = *e - _now;
}
_ttls[current->size() - 1] = ttl_left.count();
}
@@ -428,12 +428,6 @@ int32_t result_set_builder::ttl_of(size_t idx) {
}
bytes_opt result_set_builder::get_value(data_type t, query::result_atomic_cell_view c) {
if (t->is_counter()) {
fail(unimplemented::cause::COUNTERS);
#if 0
ByteBufferUtil.bytes(CounterContext.instance().total(c.value()))
#endif
}
return {to_bytes(c.value())};
}

View File

@@ -235,10 +235,10 @@ public:
private:
std::vector<api::timestamp_type> _timestamps;
std::vector<int32_t> _ttls;
const db_clock::time_point _now;
const gc_clock::time_point _now;
cql_serialization_format _cql_serialization_format;
public:
result_set_builder(const selection& s, db_clock::time_point now, cql_serialization_format sf);
result_set_builder(const selection& s, gc_clock::time_point now, cql_serialization_format sf);
void add_empty();
void add(bytes_opt value);
void add(const column_definition& def, const query::result_atomic_cell_view& c);

View File

@@ -136,9 +136,9 @@ sets::value::from_serialized(bytes_view v, set_type type, cql_serialization_form
}
}
bytes_opt
cql3::raw_value
sets::value::get(const query_options& options) {
return get_with_protocol_version(options.get_cql_serialization_format());
return cql3::raw_value::make_value(get_with_protocol_version(options.get_cql_serialization_format()));
}
bytes
@@ -191,10 +191,12 @@ sets::delayed_value::bind(const query_options& options) {
for (auto&& t : _elements) {
auto b = t->bind_and_get(options);
if (!b) {
if (b.is_null()) {
throw exceptions::invalid_request_exception("null is not supported inside collections");
}
if (b.is_unset_value()) {
return constants::UNSET_VALUE;
}
// We don't support value > 64K because the serialization format encode the length as an unsigned short.
if (b->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Set value is too long. Set values are limited to %d bytes but %d bytes value provided",
@@ -211,8 +213,10 @@ sets::delayed_value::bind(const query_options& options) {
::shared_ptr<terminal>
sets::marker::bind(const query_options& options) {
const auto& value = options.get_value_at(_bind_index);
if (!value) {
if (value.is_null()) {
return nullptr;
} else if (value.is_unset_value()) {
return constants::UNSET_VALUE;
} else {
auto as_set_type = static_pointer_cast<const set_type_impl>(_receiver->type);
return make_shared(value::from_serialized(*value, as_set_type, options.get_cql_serialization_format()));
@@ -221,6 +225,10 @@ sets::marker::bind(const query_options& options) {
void
sets::setter::execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) {
const auto& value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
}
if (column.type->is_multi_cell()) {
// delete + add
collection_type_impl::mutation mut;
@@ -229,19 +237,22 @@ sets::setter::execute(mutation& m, const exploded_clustering_prefix& row_key, co
auto col_mut = ctype->serialize_mutation_form(std::move(mut));
m.set_cell(row_key, column, std::move(col_mut));
}
adder::do_add(m, row_key, params, _t, column);
adder::do_add(m, row_key, params, value, column);
}
void
sets::adder::execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) {
const auto& value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
}
assert(column.type->is_multi_cell()); // "Attempted to add items to a frozen set";
do_add(m, row_key, params, _t, column);
do_add(m, row_key, params, value, column);
}
void
sets::adder::do_add(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params,
shared_ptr<term> t, const column_definition& column) {
auto&& value = t->bind(params._options);
shared_ptr<term> value, const column_definition& column) {
auto set_value = dynamic_pointer_cast<sets::value>(std::move(value));
auto set_type = dynamic_pointer_cast<const set_type_impl>(column.type);
if (column.type->is_multi_cell()) {

View File

@@ -79,7 +79,7 @@ public:
: _elements(std::move(elements)) {
}
static value from_serialized(bytes_view v, set_type type, cql_serialization_format sf);
virtual bytes_opt get(const query_options& options) override;
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf) override;
bool equals(set_type st, const value& v);
virtual sstring to_string() const override;
@@ -122,7 +122,7 @@ public:
}
virtual void execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) override;
static void do_add(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params,
shared_ptr<term> t, const column_definition& column);
shared_ptr<term> value, const column_definition& column);
};
// Note that this is reused for Map subtraction too (we subtract a set from a map)

View File

@@ -110,6 +110,11 @@ public:
::shared_ptr<term::raw> get_map_key() {
return _map_key;
}
::shared_ptr<term::raw> get_value() {
return _value;
}
protected:
virtual ::shared_ptr<term> to_term(const std::vector<::shared_ptr<column_specification>>& receivers,
::shared_ptr<term::raw> raw, database& db, const sstring& keyspace,
@@ -164,6 +169,13 @@ protected:
return ::make_shared<restrictions::single_column_restriction::contains>(column_def, std::move(term), is_key);
}
virtual ::shared_ptr<relation> maybe_rename_identifier(const column_identifier::raw& from, column_identifier::raw to) override {
return *_entity == from
? ::make_shared(single_column_relation(
::make_shared<column_identifier::raw>(std::move(to)), _map_key, _relation_type, _value, _in_values))
: static_pointer_cast<single_column_relation>(shared_from_this());
}
private:
/**
* Returns the receivers for this relation.

View File

@@ -103,7 +103,7 @@ shared_ptr<transport::event::schema_change> cql3::statements::alter_keyspace_sta
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_keyspace_statement::prepare(database& db) {
cql3::statements::alter_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_keyspace_statement>(*this));
}

View File

@@ -63,7 +63,7 @@ public:
void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -44,6 +44,9 @@
#include "service/migration_manager.hh"
#include "validation.hh"
#include "db/config.hh"
#include <boost/range/adaptor/filtered.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include "cql3/util.hh"
namespace cql3 {
@@ -75,12 +78,92 @@ void alter_table_statement::validate(distributed<service::storage_proxy>& proxy,
// validated in announce_migration()
}
static const sstring ALTER_TABLE_FEATURE = "ALTER TABLE";
static data_type validate_alter(schema_ptr schema, const column_definition& def, const cql3_type& validator)
{
auto type = def.type->is_reversed() && !validator.get_type()->is_reversed()
? reversed_type_impl::get_instance(validator.get_type())
: validator.get_type();
switch (def.kind) {
case column_kind::partition_key:
if (type->is_counter()) {
throw exceptions::invalid_request_exception(
sprint("counter type is not supported for PRIMARY KEY part %s", def.name_as_text()));
}
if (!type->is_value_compatible_with(*def.type)) {
throw exceptions::configuration_exception(
sprint("Cannot change %s from type %s to type %s: types are incompatible.",
def.name_as_text(),
def.type->as_cql3_type(),
validator));
}
break;
case column_kind::clustering_key:
if (!schema->is_cql3_table()) {
throw exceptions::invalid_request_exception(
sprint("Cannot alter clustering column %s in a non-CQL3 table", def.name_as_text()));
}
// Note that CFMetaData.validateCompatibility already validate the change we're about to do. However, the error message it
// sends is a bit cryptic for a CQL3 user, so validating here for a sake of returning a better error message
// Do note that we need isCompatibleWith here, not just isValueCompatibleWith.
if (!type->is_compatible_with(*def.type)) {
throw exceptions::configuration_exception(
sprint("Cannot change %s from type %s to type %s: types are not order-compatible.",
def.name_as_text(),
def.type->as_cql3_type(),
validator));
}
break;
case column_kind::regular_column:
case column_kind::static_column:
// Thrift allows to change a column validator so CFMetaData.validateCompatibility will let it slide
// if we change to an incompatible type (contrarily to the comparator case). But we don't want to
// allow it for CQL3 (see #5882) so validating it explicitly here. We only care about value compatibility
// though since we won't compare values (except when there is an index, but that is validated by
// ColumnDefinition already).
if (!type->is_value_compatible_with(*def.type)) {
throw exceptions::configuration_exception(
sprint("Cannot change %s from type %s to type %s: types are incompatible.",
def.name_as_text(),
def.type->as_cql3_type(),
validator));
}
break;
}
return type;
}
static void validate_column_rename(const schema& schema, const column_identifier& from, const column_identifier& to)
{
auto def = schema.get_column_definition(from.name());
if (!def) {
throw exceptions::invalid_request_exception(sprint("Cannot rename unknown column %s in table %s", from, schema.cf_name()));
}
if (schema.get_column_definition(to.name())) {
throw exceptions::invalid_request_exception(sprint("Cannot rename column %s to %s in table %s; another column of that name already exist", from, to, schema.cf_name()));
}
if (def->is_part_of_cell_name()) {
throw exceptions::invalid_request_exception(sprint("Cannot rename non PRIMARY KEY part %s", from));
}
if (def->is_indexed()) {
throw exceptions::invalid_request_exception(sprint("Cannot rename column %s because it is secondary indexed", from));
}
}
future<bool> alter_table_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
{
auto& db = proxy.local().get_db().local();
auto schema = validation::validate_column_family(db, keyspace(), column_family());
if (schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
}
auto cfm = schema_builder(schema);
shared_ptr<cql3_type> validator;
@@ -94,6 +177,9 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
def = get_column_definition(schema, *column_name);
}
auto& cf = db.find_column_family(schema);
std::vector<schema_ptr> view_updates;
switch (_type) {
case alter_table_statement::type::add:
{
@@ -141,6 +227,19 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
}
cfm.with_column(column_name->name(), type, _is_static ? column_kind::static_column : column_kind::regular_column);
// Adding a column to a table which has an include all view requires the column to be added to the view
// as well
if (!_is_static) {
for (auto&& view : cf.views()) {
if (view->view_info()->include_all_columns()) {
schema_builder builder(view);
builder.with_column(column_name->name(), type);
view_updates.push_back(builder.build());
}
}
}
break;
}
case alter_table_statement::type::alter:
@@ -150,57 +249,25 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
throw exceptions::invalid_request_exception(sprint("Column %s was not found in table %s", column_name, column_family()));
}
auto type = validator->get_type();
switch (def->kind) {
case column_kind::partition_key:
if (type->is_counter()) {
throw exceptions::invalid_request_exception(sprint("counter type is not supported for PRIMARY KEY part %s", column_name));
}
if (!type->is_value_compatible_with(*def->type)) {
throw exceptions::configuration_exception(sprint("Cannot change %s from type %s to type %s: types are incompatible.",
column_name,
def->type->as_cql3_type(),
validator));
}
break;
case column_kind::clustering_key:
if (!schema->is_cql3_table()) {
throw exceptions::invalid_request_exception(sprint("Cannot alter clustering column %s in a non-CQL3 table", column_name));
}
// Note that CFMetaData.validateCompatibility already validate the change we're about to do. However, the error message it
// sends is a bit cryptic for a CQL3 user, so validating here for a sake of returning a better error message
// Do note that we need isCompatibleWith here, not just isValueCompatibleWith.
if (!type->is_compatible_with(*def->type)) {
throw exceptions::configuration_exception(sprint("Cannot change %s from type %s to type %s: types are not order-compatible.",
column_name,
def->type->as_cql3_type(),
validator));
}
break;
case column_kind::regular_column:
case column_kind::static_column:
// Thrift allows to change a column validator so CFMetaData.validateCompatibility will let it slide
// if we change to an incompatible type (contrarily to the comparator case). But we don't want to
// allow it for CQL3 (see #5882) so validating it explicitly here. We only care about value compatibility
// though since we won't compare values (except when there is an index, but that is validated by
// ColumnDefinition already).
if (!type->is_value_compatible_with(*def->type)) {
throw exceptions::configuration_exception(sprint("Cannot change %s from type %s to type %s: types are incompatible.",
column_name,
def->type->as_cql3_type(),
validator));
}
break;
}
auto type = validate_alter(schema, *def, *validator);
// In any case, we update the column definition
cfm.with_altered_column_type(column_name->name(), type);
// We also have to validate the view types here. If we have a view which includes a column as part of
// the clustering key, we need to make sure that it is indeed compatible.
for (auto&& view : cf.views()) {
auto* view_def = view->get_column_definition(column_name->name());
if (view_def) {
schema_builder builder(view);
auto view_type = validate_alter(view, *view_def, *validator);
builder.with_altered_column_type(column_name->name(), std::move(view_type));
view_updates.push_back(builder.build());
}
}
break;
}
case alter_table_statement::type::drop:
{
assert(column_name);
if (!schema->is_cql3_table()) {
throw exceptions::invalid_request_exception("Cannot drop columns from a non-CQL3 table");
@@ -219,7 +286,18 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
}
}
}
// If a column is dropped which is included in a view, we don't allow the drop to take place.
auto view_names = ::join(", ", cf.views()
| boost::adaptors::filtered([&] (auto&& v) { return bool(v->get_column_definition(column_name->name())); })
| boost::adaptors::transformed([] (auto&& v) { return v->cf_name(); }));
if (!view_names.empty()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot drop column %s, depended on by materialized views (%s.{%s})",
column_name, keyspace(), view_names));
}
break;
}
case alter_table_statement::type::opts:
if (!_properties) {
@@ -228,6 +306,15 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
_properties->validate();
if (!cf.views().empty() && _properties->get_gc_grace_seconds() == 0) {
throw exceptions::invalid_request_exception(
"Cannot alter gc_grace_seconds of the base table of a "
"materialized view to 0, since this value is used to TTL "
"undelivered updates. Setting gc_grace_seconds too low might "
"cause undelivered updates to expire "
"before being replayed.");
}
if (schema->is_counter() && _properties->get_default_time_to_live() > 0) {
throw exceptions::invalid_request_exception("Cannot set default_time_to_live on a table with counters");
}
@@ -240,29 +327,39 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
auto from = entry.first->prepare_column_identifier(schema);
auto to = entry.second->prepare_column_identifier(schema);
auto def = schema->get_column_definition(from->name());
if (!def) {
throw exceptions::invalid_request_exception(sprint("Cannot rename unknown column %s in table %s", from, column_family()));
}
if (schema->get_column_definition(to->name())) {
throw exceptions::invalid_request_exception(sprint("Cannot rename column %s to %s in table %s; another column of that name already exist", from, to, column_family()));
}
if (def->is_part_of_cell_name()) {
throw exceptions::invalid_request_exception(sprint("Cannot rename non PRIMARY KEY part %s", from));
}
if (def->is_indexed()) {
throw exceptions::invalid_request_exception(sprint("Cannot rename column %s because it is secondary indexed", from));
}
validate_column_rename(*schema, *from, *to);
cfm.with_column_rename(from->name(), to->name());
// If the view includes a renamed column, it must be renamed in the view table and the definition.
for (auto&& view : cf.views()) {
if (view->get_column_definition(from->name())) {
schema_builder builder(view);
auto view_from = entry.first->prepare_column_identifier(view);
auto view_to = entry.second->prepare_column_identifier(view);
validate_column_rename(*view, *view_from, *view_to);
builder.with_column_rename(view_from->name(), view_to->name());
auto new_where = util::rename_column_in_where_clause(
view->view_info()->where_clause(),
column_identifier::raw(view_from->text(), true),
column_identifier::raw(view_to->text(), true));
builder.with_view_info(view->view_info()->base_id(), view->view_info()->base_name(),
view->view_info()->include_all_columns(), std::move(new_where));
view_updates.push_back(builder.build());
}
}
}
break;
}
return service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, is_local_only).then([] {
auto f = service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, is_local_only);
return f.then([is_local_only, view_updates = std::move(view_updates)] {
return parallel_for_each(view_updates, [is_local_only] (auto&& view) {
return service::get_local_migration_manager().announce_view_update(view_ptr(std::move(view)), is_local_only);
});
}).then([] {
return true;
});
}
@@ -274,7 +371,7 @@ shared_ptr<transport::event::schema_change> alter_table_statement::change_event(
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_table_statement::prepare(database& db) {
cql3::statements::alter_table_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_table_statement>(*this));
}

View File

@@ -80,7 +80,7 @@ public:
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -43,6 +43,7 @@
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "boost/range/adaptor/map.hpp"
#include "stdx.hh"
namespace cql3 {
@@ -86,14 +87,14 @@ const sstring& alter_type_statement::keyspace() const
return _name.get_keyspace();
}
static int32_t get_idx_of_field(user_type type, shared_ptr<column_identifier> field)
static stdx::optional<uint32_t> get_idx_of_field(user_type type, shared_ptr<column_identifier> field)
{
for (uint32_t i = 0; i < type->field_names().size(); ++i) {
if (field->name() == type->field_names()[i]) {
return i;
return {i};
}
}
return -1;
return {};
}
void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, bool is_local_only)
@@ -123,7 +124,11 @@ void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, b
}
}
if (modified) {
service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, is_local_only).get();
if (schema->is_view()) {
service::get_local_migration_manager().announce_view_update(view_ptr(cfm.build()), is_local_only).get();
} else {
service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, is_local_only).get();
}
}
}
@@ -164,7 +169,7 @@ alter_type_statement::add_or_alter::add_or_alter(const ut_name& name, bool is_ad
user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_update) const
{
if (get_idx_of_field(to_update, _field_name) >= 0) {
if (get_idx_of_field(to_update, _field_name)) {
throw exceptions::invalid_request_exception(sprint("Cannot add new field %s to type %s: a field of the same name already exists", _field_name->name(), _name.to_string()));
}
@@ -181,19 +186,19 @@ user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_
user_type alter_type_statement::add_or_alter::do_alter(database& db, user_type to_update) const
{
uint32_t idx = get_idx_of_field(to_update, _field_name);
if (idx < 0) {
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, _field_name);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", _field_name->name(), _name.to_string()));
}
auto previous = to_update->field_types()[idx];
auto previous = to_update->field_types()[*idx];
auto new_type = _field_type->prepare(db, keyspace())->get_type();
if (!new_type->is_compatible_with(*previous)) {
throw exceptions::invalid_request_exception(sprint("Type %s in incompatible with previous type %s of field %s in user type %s", _field_type->to_string(), previous->as_cql3_type()->to_string(), _field_name->name(), _name.to_string()));
}
std::vector<data_type> new_types(to_update->field_types());
new_types[idx] = new_type;
new_types[*idx] = new_type;
return user_type_impl::get_instance(to_update->_keyspace, to_update->_name, to_update->field_names(), std::move(new_types));
}
@@ -217,11 +222,11 @@ user_type alter_type_statement::renames::make_updated_type(database& db, user_ty
std::vector<bytes> new_names(to_update->field_names());
for (auto&& rename : _renames) {
auto&& from = rename.first;
int32_t idx = get_idx_of_field(to_update, from);
if (idx < 0) {
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, from);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", from->to_string(), _name.to_string()));
}
new_names[idx] = rename.second->name();
new_names[*idx] = rename.second->name();
}
auto&& updated = user_type_impl::get_instance(to_update->_keyspace, to_update->_name, std::move(new_names), to_update->field_types());
create_type_statement::check_for_duplicate_names(updated);
@@ -229,12 +234,12 @@ user_type alter_type_statement::renames::make_updated_type(database& db, user_ty
}
shared_ptr<cql3::statements::prepared_statement>
alter_type_statement::add_or_alter::prepare(database& db) {
alter_type_statement::add_or_alter::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_type_statement::add_or_alter>(*this));
}
shared_ptr<cql3::statements::prepared_statement>
alter_type_statement::renames::prepare(database& db) {
alter_type_statement::renames::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_type_statement::renames>(*this));
}

View File

@@ -84,7 +84,7 @@ public:
const shared_ptr<column_identifier> field_name,
const shared_ptr<cql3_type::raw> field_type);
virtual user_type make_updated_type(database& db, user_type to_update) const override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
private:
user_type do_add(database& db, user_type to_update) const;
user_type do_alter(database& db, user_type to_update) const;
@@ -101,7 +101,7 @@ public:
void add_rename(shared_ptr<column_identifier> previous_name, shared_ptr<column_identifier> new_name);
virtual user_type make_updated_type(database& db, user_type to_update) const override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -0,0 +1,121 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "cql3/statements/alter_view_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "service/migration_manager.hh"
#include "validation.hh"
namespace cql3 {
namespace statements {
alter_view_statement::alter_view_statement(::shared_ptr<cf_name> view_name, ::shared_ptr<cf_prop_defs> properties)
: schema_altering_statement{std::move(view_name)}
, _properties{std::move(properties)}
{
}
future<> alter_view_statement::check_access(const service::client_state& state)
{
try {
auto&& s = service::get_local_storage_proxy().get_db().local().find_schema(keyspace(), column_family());
if (s->is_view()) {
return state.has_column_family_access(keyspace(), s->view_info()->base_name(), auth::permission::ALTER);
}
} catch (const no_such_column_family& e) {
// Will be validated afterwards.
}
return make_ready_future<>();
}
void alter_view_statement::validate(distributed<service::storage_proxy>&, const service::client_state& state)
{
// validated in announce_migration()
}
future<bool> alter_view_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
{
auto&& db = proxy.local().get_db().local();
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
if (!schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER MATERIALIZED VIEW on Table");
}
if (!_properties) {
throw exceptions::invalid_request_exception("ALTER MATERIALIZED VIEW WITH invoked, but no parameters found");
}
_properties->validate();
auto builder = schema_builder(schema);
_properties->apply_to_builder(builder);
if (builder.get_gc_grace_seconds() == 0) {
throw exceptions::invalid_request_exception(
"Cannot alter gc_grace_seconds of a materialized view to 0, since this "
"value is used to TTL undelivered updates. Setting gc_grace_seconds too "
"low might cause undelivered updates to expire before being replayed.");
}
return service::get_local_migration_manager().announce_view_update(view_ptr(builder.build()), is_local_only).then([] {
return true;
});
}
shared_ptr<transport::event::schema_change> alter_view_statement::change_event()
{
using namespace transport;
return make_shared<event::schema_change>(event::schema_change::change_type::UPDATED,
event::schema_change::target_type::TABLE,
keyspace(),
column_family());
}
shared_ptr<cql3::statements::prepared_statement>
alter_view_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_view_statement>(*this));
}
}
}

View File

@@ -0,0 +1,74 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastar/core/shared_ptr.hh>
#include "database.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/cf_name.hh"
namespace cql3 {
namespace statements {
/** An <code>ALTER MATERIALIZED VIEW</code> parsed from a CQL query statement. */
class alter_view_statement : public schema_altering_statement {
private:
::shared_ptr<cf_prop_defs> _properties;
public:
alter_view_statement(::shared_ptr<cf_name> view_name, ::shared_ptr<cf_prop_defs> properties);
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}
}

View File

@@ -47,7 +47,7 @@ uint32_t cql3::statements::authentication_statement::get_bound_terms() {
}
::shared_ptr<cql3::statements::prepared_statement> cql3::statements::authentication_statement::prepare(
database& db) {
database& db, cql_stats& stats) {
return ::make_shared<prepared>(this->shared_from_this());
}

View File

@@ -54,7 +54,7 @@ class authentication_statement : public raw::parsed_statement, public cql_statem
public:
uint32_t get_bound_terms() override;
::shared_ptr<prepared> prepare(database& db) override;
::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
bool uses_function(const sstring& ks_name, const sstring& function_name) const override;

View File

@@ -47,7 +47,7 @@ uint32_t cql3::statements::authorization_statement::get_bound_terms() {
}
::shared_ptr<cql3::statements::prepared_statement> cql3::statements::authorization_statement::prepare(
database& db) {
database& db, cql_stats& stats) {
return ::make_shared<parsed_statement::prepared>(this->shared_from_this());
}

View File

@@ -58,7 +58,7 @@ class authorization_statement : public raw::parsed_statement, public cql_stateme
public:
uint32_t get_bound_terms() override;
::shared_ptr<prepared> prepare(database& db) override;
::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
bool uses_function(const sstring& ks_name, const sstring& function_name) const override;

View File

@@ -41,12 +41,48 @@
#include "raw/batch_statement.hh"
#include "db/config.hh"
namespace {
struct mutation_equals_by_key {
bool operator()(const mutation& m1, const mutation& m2) const {
return m1.schema() == m2.schema()
&& m1.decorated_key().equal(*m1.schema(), m2.decorated_key());
}
};
struct mutation_hash_by_key {
size_t operator()(const mutation& m) const {
auto dk_hash = std::hash<dht::decorated_key>();
return dk_hash(m.decorated_key());
}
};
}
namespace cql3 {
namespace statements {
logging::logger batch_statement::_logger("BatchStatement");
batch_statement::batch_statement(int bound_terms, type type_,
std::vector<shared_ptr<modification_statement>> statements,
std::unique_ptr<attributes> attrs,
cql_stats& stats)
: _bound_terms(bound_terms), _type(type_), _statements(std::move(statements))
, _attrs(std::move(attrs))
, _has_conditions(boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::has_conditions)))
, _stats(stats)
{
}
bool batch_statement::uses_function(const sstring& ks_name, const sstring& function_name) const
{
return _attrs->uses_function(ks_name, function_name)
|| boost::algorithm::any_of(_statements, [&] (auto&& s) { return s->uses_function(ks_name, function_name); });
}
bool batch_statement::depends_on_keyspace(const sstring& ks_name) const
{
return false;
@@ -57,6 +93,112 @@ bool batch_statement::depends_on_column_family(const sstring& cf_name) const
return false;
}
uint32_t batch_statement::get_bound_terms()
{
return _bound_terms;
}
future<> batch_statement::check_access(const service::client_state& state)
{
return parallel_for_each(_statements.begin(), _statements.end(), [&state](auto&& s) {
return s->check_access(state);
});
}
void batch_statement::validate()
{
if (_attrs->is_time_to_live_set()) {
throw exceptions::invalid_request_exception("Global TTL on the BATCH statement is not supported.");
}
bool timestamp_set = _attrs->is_timestamp_set();
if (timestamp_set) {
if (_has_conditions) {
throw exceptions::invalid_request_exception("Cannot provide custom timestamp for conditional BATCH");
}
if (_type == type::COUNTER) {
throw exceptions::invalid_request_exception("Cannot provide custom timestamp for counter BATCH");
}
}
bool has_counters = boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::is_counter));
bool has_non_counters = !boost::algorithm::all_of(_statements, std::mem_fn(&modification_statement::is_counter));
if (timestamp_set && has_counters) {
throw exceptions::invalid_request_exception("Cannot provide custom timestamp for a BATCH containing counters");
}
if (timestamp_set && boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::is_timestamp_set))) {
throw exceptions::invalid_request_exception("Timestamp must be set either on BATCH or individual statements");
}
if (_type == type::COUNTER && has_non_counters) {
throw exceptions::invalid_request_exception("Cannot include non-counter statement in a counter batch");
}
if (_type == type::LOGGED && has_counters) {
throw exceptions::invalid_request_exception("Cannot include a counter statement in a logged batch");
}
if (has_counters && has_non_counters) {
throw exceptions::invalid_request_exception("Counter and non-counter mutations cannot exist in the same batch");
}
if (_has_conditions
&& !_statements.empty()
&& (boost::distance(_statements
| boost::adaptors::transformed(std::mem_fn(&modification_statement::keyspace))
| boost::adaptors::uniqued) != 1
|| (boost::distance(_statements
| boost::adaptors::transformed(std::mem_fn(&modification_statement::column_family))
| boost::adaptors::uniqued) != 1))) {
throw exceptions::invalid_request_exception("Batch with conditions cannot span multiple tables");
}
}
void batch_statement::validate(distributed<service::storage_proxy>& proxy, const service::client_state& state)
{
for (auto&& s : _statements) {
s->validate(proxy, state);
}
}
const std::vector<shared_ptr<modification_statement>>& batch_statement::get_statements()
{
return _statements;
}
future<std::vector<mutation>> batch_statement::get_mutations(distributed<service::storage_proxy>& storage, const query_options& options, bool local, api::timestamp_type now, tracing::trace_state_ptr trace_state) {
// Do not process in parallel because operations like list append/prepend depend on execution order.
using mutation_set_type = std::unordered_set<mutation, mutation_hash_by_key, mutation_equals_by_key>;
return do_with(mutation_set_type(), [this, &storage, &options, now, local, trace_state] (auto& result) {
result.reserve(_statements.size());
_stats.statements_in_batches += _statements.size();
return do_for_each(boost::make_counting_iterator<size_t>(0),
boost::make_counting_iterator<size_t>(_statements.size()),
[this, &storage, &options, now, local, &result, trace_state] (size_t i) {
auto&& statement = _statements[i];
statement->inc_cql_stats();
auto&& statement_options = options.for_statement(i);
auto timestamp = _attrs->get_timestamp(now, statement_options);
return statement->get_mutations(storage, statement_options, local, timestamp, trace_state).then([&result] (auto&& more) {
for (auto&& m : more) {
// We want unordered_set::try_emplace(), but we don't have it
auto pos = result.find(m);
if (pos == result.end()) {
result.emplace(std::move(m));
} else {
const_cast<mutation&>(*pos).apply(std::move(m)); // Won't change key
}
}
});
}).then([&result] {
// can't use range adaptors, because we want to move
auto vresult = std::vector<mutation>();
vresult.reserve(result.size());
for (auto&& m : result) {
vresult.push_back(std::move(m));
}
return vresult;
});
});
}
void batch_statement::verify_batch_size(const std::vector<mutation>& mutations) {
size_t warn_threshold = service::get_local_storage_proxy().get_db().local().get_config().batch_size_warn_threshold_in_kb() * 1024;
@@ -99,25 +241,174 @@ void batch_statement::verify_batch_size(const std::vector<mutation>& mutations)
}
}
future<shared_ptr<transport::messages::result_message>> batch_statement::execute(
distributed<service::storage_proxy>& storage, service::query_state& state, const query_options& options) {
++_stats.batches;
return execute(storage, state, options, false, options.get_timestamp(state));
}
future<shared_ptr<transport::messages::result_message>> batch_statement::execute(
distributed<service::storage_proxy>& storage,
service::query_state& query_state, const query_options& options,
bool local, api::timestamp_type now)
{
// FIXME: we don't support nulls here
#if 0
if (options.get_consistency() == null)
throw new InvalidRequestException("Invalid empty consistency level");
if (options.getSerialConsistency() == null)
throw new InvalidRequestException("Invalid empty serial consistency level");
#endif
if (_has_conditions) {
return execute_with_conditions(storage, options, query_state);
}
return get_mutations(storage, options, local, now, query_state.get_trace_state()).then([this, &storage, &options, tr_state = query_state.get_trace_state()] (std::vector<mutation> ms) mutable {
return execute_without_conditions(storage, std::move(ms), options.get_consistency(), std::move(tr_state));
}).then([] {
return make_ready_future<shared_ptr<transport::messages::result_message>>(
make_shared<transport::messages::result_message::void_message>());
});
}
future<> batch_statement::execute_without_conditions(
distributed<service::storage_proxy>& storage,
std::vector<mutation> mutations,
db::consistency_level cl,
tracing::trace_state_ptr tr_state)
{
// FIXME: do we need to do this?
#if 0
// Extract each collection of cfs from it's IMutation and then lazily concatenate all of them into a single Iterable.
Iterable<ColumnFamily> cfs = Iterables.concat(Iterables.transform(mutations, new Function<IMutation, Collection<ColumnFamily>>()
{
public Collection<ColumnFamily> apply(IMutation im)
{
return im.getColumnFamilies();
}
}));
#endif
verify_batch_size(mutations);
bool mutate_atomic = true;
if (_type != type::LOGGED) {
_stats.batches_pure_unlogged += 1;
mutate_atomic = false;
} else {
if (mutations.size() > 1) {
_stats.batches_pure_logged += 1;
} else {
_stats.batches_unlogged_from_logged += 1;
mutate_atomic = false;
}
}
return storage.local().mutate_with_triggers(std::move(mutations), cl, mutate_atomic, std::move(tr_state));
}
future<shared_ptr<transport::messages::result_message>> batch_statement::execute_with_conditions(
distributed<service::storage_proxy>& storage,
const query_options& options,
service::query_state& state)
{
fail(unimplemented::cause::LWT);
#if 0
auto now = state.get_timestamp();
ByteBuffer key = null;
String ksName = null;
String cfName = null;
CQL3CasRequest casRequest = null;
Set<ColumnDefinition> columnsWithConditions = new LinkedHashSet<>();
for (int i = 0; i < statements.size(); i++)
{
ModificationStatement statement = statements.get(i);
QueryOptions statementOptions = options.forStatement(i);
long timestamp = attrs.getTimestamp(now, statementOptions);
List<ByteBuffer> pks = statement.buildPartitionKeyNames(statementOptions);
if (pks.size() > 1)
throw new IllegalArgumentException("Batch with conditions cannot span multiple partitions (you cannot use IN on the partition key)");
if (key == null)
{
key = pks.get(0);
ksName = statement.cfm.ksName;
cfName = statement.cfm.cfName;
casRequest = new CQL3CasRequest(statement.cfm, key, true);
}
else if (!key.equals(pks.get(0)))
{
throw new InvalidRequestException("Batch with conditions cannot span multiple partitions");
}
Composite clusteringPrefix = statement.createClusteringPrefix(statementOptions);
if (statement.hasConditions())
{
statement.addConditions(clusteringPrefix, casRequest, statementOptions);
// As soon as we have a ifNotExists, we set columnsWithConditions to null so that everything is in the resultSet
if (statement.hasIfNotExistCondition() || statement.hasIfExistCondition())
columnsWithConditions = null;
else if (columnsWithConditions != null)
Iterables.addAll(columnsWithConditions, statement.getColumnsWithConditions());
}
casRequest.addRowUpdate(clusteringPrefix, statement, statementOptions, timestamp);
}
ColumnFamily result = StorageProxy.cas(ksName, cfName, key, casRequest, options.getSerialConsistency(), options.getConsistency(), state.getClientState());
return new ResultMessage.Rows(ModificationStatement.buildCasResultSet(ksName, key, cfName, result, columnsWithConditions, true, options.forStatement(0)));
#endif
}
future<shared_ptr<transport::messages::result_message>> batch_statement::execute_internal(
distributed<service::storage_proxy>& proxy,
service::query_state& query_state, const query_options& options)
{
throw std::runtime_error(sprint("%s not implemented", __PRETTY_FUNCTION__));
#if 0
assert !hasConditions;
for (IMutation mutation : getMutations(BatchQueryOptions.withoutPerStatementVariables(options), true, queryState.getTimestamp()))
{
// We don't use counters internally.
assert mutation instanceof Mutation;
((Mutation) mutation).apply();
}
return null;
#endif
}
namespace raw {
shared_ptr<prepared_statement>
batch_statement::prepare(database& db) {
batch_statement::prepare(database& db, cql_stats& stats) {
auto&& bound_names = get_bound_variables();
stdx::optional<sstring> first_ks;
stdx::optional<sstring> first_cf;
bool have_multiple_cfs = false;
std::vector<shared_ptr<cql3::statements::modification_statement>> statements;
for (auto&& parsed : _parsed_statements) {
statements.push_back(parsed->prepare(db, bound_names));
if (!first_ks) {
first_ks = parsed->keyspace();
first_cf = parsed->column_family();
} else {
have_multiple_cfs = first_ks.value() != parsed->keyspace() || first_cf.value() != parsed->column_family();
}
statements.push_back(parsed->prepare(db, bound_names, stats));
}
auto&& prep_attrs = _attrs->prepare(db, "[batch]", "[batch]");
prep_attrs->collect_marker_specification(bound_names);
cql3::statements::batch_statement batch_statement_(bound_names->size(), _type, std::move(statements), std::move(prep_attrs));
cql3::statements::batch_statement batch_statement_(bound_names->size(), _type, std::move(statements), std::move(prep_attrs), stats);
batch_statement_.validate();
std::vector<uint16_t> partition_key_bind_indices;
if (!have_multiple_cfs && batch_statement_.get_statements().size() > 0) {
partition_key_bind_indices = bound_names->get_partition_key_bind_indexes(batch_statement_.get_statements()[0]->s);
}
return ::make_shared<prepared>(make_shared(std::move(batch_statement_)),
bound_names->get_specifications());
bound_names->get_specifications(),
std::move(partition_key_bind_indices));
}
}

View File

@@ -68,12 +68,11 @@ public:
using type = raw::batch_statement::type;
private:
int _bound_terms;
public:
type _type;
private:
std::vector<shared_ptr<modification_statement>> _statements;
std::unique_ptr<attributes> _attrs;
bool _has_conditions;
cql_stats& _stats;
public:
/**
* Creates a new BatchStatement from a list of statements and a
@@ -85,106 +84,29 @@ public:
*/
batch_statement(int bound_terms, type type_,
std::vector<shared_ptr<modification_statement>> statements,
std::unique_ptr<attributes> attrs)
: _bound_terms(bound_terms), _type(type_), _statements(std::move(statements))
, _attrs(std::move(attrs))
, _has_conditions(boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::has_conditions))) {
}
std::unique_ptr<attributes> attrs,
cql_stats& stats);
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return _attrs->uses_function(ks_name, function_name)
|| boost::algorithm::any_of(_statements, [&] (auto&& s) { return s->uses_function(ks_name, function_name); });
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual uint32_t get_bound_terms() override {
return _bound_terms;
}
virtual uint32_t get_bound_terms() override;
virtual future<> check_access(const service::client_state& state) override {
return parallel_for_each(_statements.begin(), _statements.end(), [&state](auto&& s) {
return s->check_access(state);
});
}
virtual future<> check_access(const service::client_state& state) override;
// Validates a prepared batch statement without validating its nested statements.
void validate() {
if (_attrs->is_time_to_live_set()) {
throw exceptions::invalid_request_exception("Global TTL on the BATCH statement is not supported.");
}
bool timestamp_set = _attrs->is_timestamp_set();
if (timestamp_set) {
if (_has_conditions) {
throw exceptions::invalid_request_exception("Cannot provide custom timestamp for conditional BATCH");
}
if (_type == type::COUNTER) {
throw exceptions::invalid_request_exception("Cannot provide custom timestamp for counter BATCH");
}
}
bool has_counters = boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::is_counter));
bool has_non_counters = !boost::algorithm::all_of(_statements, std::mem_fn(&modification_statement::is_counter));
if (timestamp_set && has_counters) {
throw exceptions::invalid_request_exception("Cannot provide custom timestamp for a BATCH containing counters");
}
if (timestamp_set && boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::is_timestamp_set))) {
throw exceptions::invalid_request_exception("Timestamp must be set either on BATCH or individual statements");
}
if (_type == type::COUNTER && has_non_counters) {
throw exceptions::invalid_request_exception("Cannot include non-counter statement in a counter batch");
}
if (_type == type::LOGGED && has_counters) {
throw exceptions::invalid_request_exception("Cannot include a counter statement in a logged batch");
}
if (has_counters && has_non_counters) {
throw exceptions::invalid_request_exception("Counter and non-counter mutations cannot exist in the same batch");
}
if (_has_conditions
&& !_statements.empty()
&& (boost::distance(_statements
| boost::adaptors::transformed(std::mem_fn(&modification_statement::keyspace))
| boost::adaptors::uniqued) != 1
|| (boost::distance(_statements
| boost::adaptors::transformed(std::mem_fn(&modification_statement::column_family))
| boost::adaptors::uniqued) != 1))) {
throw exceptions::invalid_request_exception("Batch with conditions cannot span multiple tables");
}
}
void validate();
// The batch itself will be validated in either Parsed#prepare() - for regular CQL3 batches,
// or in QueryProcessor.processBatch() - for native protocol batches.
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override {
for (auto&& s : _statements) {
s->validate(proxy, state);
}
}
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
const std::vector<shared_ptr<modification_statement>>& get_statements() {
return _statements;
}
const std::vector<shared_ptr<modification_statement>>& get_statements();
private:
future<std::vector<mutation>> get_mutations(distributed<service::storage_proxy>& storage, const query_options& options, bool local, api::timestamp_type now, tracing::trace_state_ptr trace_state) {
// Do not process in parallel because operations like list append/prepend depend on execution order.
return do_with(std::vector<mutation>(), [this, &storage, &options, now, local, trace_state] (auto&& result) {
return do_for_each(boost::make_counting_iterator<size_t>(0),
boost::make_counting_iterator<size_t>(_statements.size()),
[this, &storage, &options, now, local, &result, trace_state] (size_t i) {
auto&& statement = _statements[i];
auto&& statement_options = options.for_statement(i);
auto timestamp = _attrs->get_timestamp(now, statement_options);
return statement->get_mutations(storage, statement_options, local, timestamp, trace_state).then([&result] (auto&& more) {
std::move(more.begin(), more.end(), std::back_inserter(result));
});
}).then([&result] {
return std::move(result);
});
});
}
future<std::vector<mutation>> get_mutations(distributed<service::storage_proxy>& storage, const query_options& options, bool local, api::timestamp_type now, tracing::trace_state_ptr trace_state);
public:
/**
@@ -194,123 +116,27 @@ public:
static void verify_batch_size(const std::vector<mutation>& mutations);
virtual future<shared_ptr<transport::messages::result_message>> execute(
distributed<service::storage_proxy>& storage, service::query_state& state, const query_options& options) override {
return execute(storage, state, options, false, options.get_timestamp(state));
}
distributed<service::storage_proxy>& storage, service::query_state& state, const query_options& options) override;
private:
future<shared_ptr<transport::messages::result_message>> execute(
distributed<service::storage_proxy>& storage,
service::query_state& query_state, const query_options& options,
bool local, api::timestamp_type now) {
// FIXME: we don't support nulls here
#if 0
if (options.get_consistency() == null)
throw new InvalidRequestException("Invalid empty consistency level");
if (options.getSerialConsistency() == null)
throw new InvalidRequestException("Invalid empty serial consistency level");
#endif
if (_has_conditions) {
return execute_with_conditions(storage, options, query_state);
}
return get_mutations(storage, options, local, now, query_state.get_trace_state()).then([this, &storage, &options, tr_state = query_state.get_trace_state()] (std::vector<mutation> ms) mutable {
return execute_without_conditions(storage, std::move(ms), options.get_consistency(), std::move(tr_state));
}).then([] {
return make_ready_future<shared_ptr<transport::messages::result_message>>(
make_shared<transport::messages::result_message::void_message>());
});
}
bool local, api::timestamp_type now);
future<> execute_without_conditions(
distributed<service::storage_proxy>& storage,
std::vector<mutation> mutations,
db::consistency_level cl,
tracing::trace_state_ptr tr_state) {
// FIXME: do we need to do this?
#if 0
// Extract each collection of cfs from it's IMutation and then lazily concatenate all of them into a single Iterable.
Iterable<ColumnFamily> cfs = Iterables.concat(Iterables.transform(mutations, new Function<IMutation, Collection<ColumnFamily>>()
{
public Collection<ColumnFamily> apply(IMutation im)
{
return im.getColumnFamilies();
}
}));
#endif
verify_batch_size(mutations);
bool mutate_atomic = _type == type::LOGGED && mutations.size() > 1;
return storage.local().mutate_with_triggers(std::move(mutations), cl, mutate_atomic, std::move(tr_state));
}
tracing::trace_state_ptr tr_state);
future<shared_ptr<transport::messages::result_message>> execute_with_conditions(
distributed<service::storage_proxy>& storage,
const query_options& options,
service::query_state& state) {
fail(unimplemented::cause::LWT);
#if 0
auto now = state.get_timestamp();
ByteBuffer key = null;
String ksName = null;
String cfName = null;
CQL3CasRequest casRequest = null;
Set<ColumnDefinition> columnsWithConditions = new LinkedHashSet<>();
for (int i = 0; i < statements.size(); i++)
{
ModificationStatement statement = statements.get(i);
QueryOptions statementOptions = options.forStatement(i);
long timestamp = attrs.getTimestamp(now, statementOptions);
List<ByteBuffer> pks = statement.buildPartitionKeyNames(statementOptions);
if (pks.size() > 1)
throw new IllegalArgumentException("Batch with conditions cannot span multiple partitions (you cannot use IN on the partition key)");
if (key == null)
{
key = pks.get(0);
ksName = statement.cfm.ksName;
cfName = statement.cfm.cfName;
casRequest = new CQL3CasRequest(statement.cfm, key, true);
}
else if (!key.equals(pks.get(0)))
{
throw new InvalidRequestException("Batch with conditions cannot span multiple partitions");
}
Composite clusteringPrefix = statement.createClusteringPrefix(statementOptions);
if (statement.hasConditions())
{
statement.addConditions(clusteringPrefix, casRequest, statementOptions);
// As soon as we have a ifNotExists, we set columnsWithConditions to null so that everything is in the resultSet
if (statement.hasIfNotExistCondition() || statement.hasIfExistCondition())
columnsWithConditions = null;
else if (columnsWithConditions != null)
Iterables.addAll(columnsWithConditions, statement.getColumnsWithConditions());
}
casRequest.addRowUpdate(clusteringPrefix, statement, statementOptions, timestamp);
}
ColumnFamily result = StorageProxy.cas(ksName, cfName, key, casRequest, options.getSerialConsistency(), options.getConsistency(), state.getClientState());
return new ResultMessage.Rows(ModificationStatement.buildCasResultSet(ksName, key, cfName, result, columnsWithConditions, true, options.forStatement(0)));
#endif
}
service::query_state& state);
public:
virtual future<shared_ptr<transport::messages::result_message>> execute_internal(
distributed<service::storage_proxy>& proxy,
service::query_state& query_state, const query_options& options) override {
throw std::runtime_error(sprint("%s not implemented", __PRETTY_FUNCTION__));
#if 0
assert !hasConditions;
for (IMutation mutation : getMutations(BatchQueryOptions.withoutPerStatementVariables(options), true, queryState.getTimestamp()))
{
// We don't use counters internally.
assert mutation instanceof Mutation;
((Mutation) mutation).apply();
}
return null;
#endif
}
service::query_state& query_state, const query_options& options) override;
// FIXME: no cql_statement::to_string() yet
#if 0

View File

@@ -100,12 +100,12 @@ void cf_prop_defs::validate() {
}
auto compression_options = get_compression_options();
if (!compression_options.empty()) {
auto sstable_compression_class = compression_options.find(sstring(compression_parameters::SSTABLE_COMPRESSION));
if (sstable_compression_class == compression_options.end()) {
if (compression_options && !compression_options->empty()) {
auto sstable_compression_class = compression_options->find(sstring(compression_parameters::SSTABLE_COMPRESSION));
if (sstable_compression_class == compression_options->end()) {
throw exceptions::configuration_exception(sstring("Missing sub-option '") + compression_parameters::SSTABLE_COMPRESSION + "' for the '" + KW_COMPRESSION + "' option.");
}
compression_parameters cp(compression_options);
compression_parameters cp(*compression_options);
cp.validate();
}
@@ -131,12 +131,12 @@ std::map<sstring, sstring> cf_prop_defs::get_compaction_options() const {
return std::map<sstring, sstring>{};
}
std::map<sstring, sstring> cf_prop_defs::get_compression_options() const {
stdx::optional<std::map<sstring, sstring>> cf_prop_defs::get_compression_options() const {
auto compression_options = get_map(KW_COMPRESSION);
if (compression_options) {
return compression_options.value();
return { compression_options.value() };
}
return std::map<sstring, sstring>{};
return { };
}
int32_t cf_prop_defs::get_default_time_to_live() const
@@ -144,6 +144,11 @@ int32_t cf_prop_defs::get_default_time_to_live() const
return get_int(KW_DEFAULT_TIME_TO_LIVE, 0);
}
int32_t cf_prop_defs::get_gc_grace_seconds() const
{
return get_int(KW_GCGRACESECONDS, DEFAULT_GC_GRACE_SECONDS);
}
void cf_prop_defs::apply_to_builder(schema_builder& builder) {
if (has_property(KW_COMMENT)) {
builder.set_comment(get_string(KW_COMMENT, ""));
@@ -206,8 +211,9 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder) {
}
builder.set_bloom_filter_fp_chance(get_double(KW_BF_FP_CHANCE, builder.get_bloom_filter_fp_chance()));
if (!get_compression_options().empty()) {
builder.set_compressor_params(compression_parameters(get_compression_options()));
auto compression_options = get_compression_options();
if (compression_options) {
builder.set_compressor_params(compression_parameters(*compression_options));
}
#if 0
CachingOptions cachingOptions = getCachingOptions();

View File

@@ -82,7 +82,7 @@ private:
public:
void validate();
std::map<sstring, sstring> get_compaction_options() const;
std::map<sstring, sstring> get_compression_options() const;
stdx::optional<std::map<sstring, sstring>> get_compression_options() const;
#if 0
public CachingOptions getCachingOptions() throws SyntaxException, ConfigurationException
{
@@ -101,6 +101,7 @@ public:
}
#endif
int32_t get_default_time_to_live() const;
int32_t get_gc_grace_seconds() const;
void apply_to_builder(schema_builder& builder);
void validate_minimum_int(const sstring& field, int32_t minimum_value, int32_t default_value) const;

View File

@@ -0,0 +1,103 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/statements/cf_prop_defs.hh"
namespace cql3 {
namespace statements {
/**
* Class for common statement properties.
*/
class cf_properties final {
const ::shared_ptr<cf_prop_defs> _properties = ::make_shared<cf_prop_defs>();
bool _use_compact_storage = false;
std::vector<std::pair<::shared_ptr<column_identifier>, bool>> _defined_ordering; // Insertion ordering is important
public:
auto& properties() const {
return _properties;
}
bool use_compact_storage() const {
return _use_compact_storage;
}
void set_compact_storage() {
_use_compact_storage = true;
}
auto& defined_ordering() const {
return _defined_ordering;
}
data_type get_reversable_type(::shared_ptr<column_identifier> t, data_type type) const {
auto is_reversed = find_ordering_info(t).value_or(false);
if (!is_reversed && type->is_reversed()) {
return static_pointer_cast<const reversed_type_impl>(type)->underlying_type();
}
if (is_reversed && !type->is_reversed()) {
return reversed_type_impl::get_instance(type);
}
return type;
}
std::experimental::optional<bool> find_ordering_info(::shared_ptr<column_identifier> type) const {
for (auto& t: _defined_ordering) {
if (*(t.first) == *type) {
return t.second;
}
}
return {};
}
void set_ordering(::shared_ptr<column_identifier> alias, bool reversed) {
_defined_ordering.emplace_back(alias, reversed);
}
void validate() {
_properties->validate();
}
};
}
}

View File

@@ -207,7 +207,7 @@ cql3::statements::create_index_statement::announce_migration(distributed<service
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::create_index_statement::prepare(database& db) {
cql3::statements::create_index_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_index_statement>(*this));
}

View File

@@ -87,7 +87,7 @@ public:
transport::event::schema_change::target_type::TABLE, keyspace(),
column_family());
}
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -124,7 +124,7 @@ shared_ptr<transport::event::schema_change> create_keyspace_statement::change_ev
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::create_keyspace_statement::prepare(database& db) {
cql3::statements::create_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_keyspace_statement>(*this));
}

View File

@@ -84,7 +84,7 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -50,6 +50,7 @@
#include "cql3/statements/prepared_statement.hh"
#include "schema_builder.hh"
#include "service/storage_service.hh"
namespace cql3 {
@@ -64,12 +65,6 @@ create_table_statement::create_table_statement(::shared_ptr<cf_name> name,
, _properties{properties}
, _if_not_exists{if_not_exists}
{
if (!properties->has_property(cf_prop_defs::KW_COMPRESSION) && schema::DEFAULT_COMPRESSOR) {
std::map<sstring, sstring> compression = {
{ sstring(compression_parameters::SSTABLE_COMPRESSION), schema::DEFAULT_COMPRESSOR.value() },
};
properties->add_property(cf_prop_defs::KW_COMPRESSION, compression);
}
}
future<> create_table_statement::check_access(const service::client_state& state) {
@@ -157,7 +152,7 @@ void create_table_statement::add_column_metadata_from_aliases(schema_builder& bu
}
shared_ptr<prepared_statement>
create_table_statement::prepare(database& db) {
create_table_statement::prepare(database& db, cql_stats& stats) {
// Cannot happen; create_table_statement is never instantiated as a raw statement
// (instead we instantiate create_table_statement::raw_statement)
abort();
@@ -169,7 +164,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
, _if_not_exists{if_not_exists}
{ }
::shared_ptr<prepared_statement> create_table_statement::raw_statement::prepare(database& db) {
::shared_ptr<prepared_statement> create_table_statement::raw_statement::prepare(database& db, cql_stats& stats) {
// Column family name
const sstring& cf_name = _cf_name->get_column_family();
std::regex name_regex("\\w+");
@@ -188,17 +183,16 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
throw exceptions::invalid_request_exception(sprint("Multiple definition of identifier %s", (*i)->text()));
}
properties->validate();
_properties.validate();
auto stmt = ::make_shared<create_table_statement>(_cf_name, properties, _if_not_exists, _static_columns);
auto stmt = ::make_shared<create_table_statement>(_cf_name, _properties.properties(), _if_not_exists, _static_columns);
std::experimental::optional<std::map<bytes, data_type>> defined_multi_cell_collections;
for (auto&& entry : _definitions) {
::shared_ptr<column_identifier> id = entry.first;
::shared_ptr<cql3_type> pt = entry.second->prepare(db, keyspace());
// FIXME: remove this check once we support counters
if (pt->is_counter()) {
fail(unimplemented::cause::COUNTERS);
if (pt->is_counter() && !service::get_local_storage_service().cluster_supports_counters()) {
throw exceptions::invalid_request_exception("Counter support is not enabled");
}
if (pt->is_collection() && pt->get_type()->is_multi_cell()) {
if (!defined_multi_cell_collections) {
@@ -214,7 +208,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
throw exceptions::invalid_request_exception("Multiple PRIMARY KEYs specifed (exactly one required)");
}
stmt->_use_compact_storage = _use_compact_storage;
stmt->_use_compact_storage = _properties.use_compact_storage();
auto& key_aliases = _key_aliases[0];
std::vector<data_type> key_types;
@@ -233,7 +227,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
// Handle column aliases
if (_column_aliases.empty()) {
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
// There should remain some column definition since it is a non-composite "static" CF
if (stmt->_columns.empty()) {
throw exceptions::invalid_request_exception("No definition found that is not part of the PRIMARY KEY");
@@ -246,7 +240,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
} else {
// If we use compact storage and have only one alias, it is a
// standard "dynamic" CF, otherwise it's a composite
if (_use_compact_storage && _column_aliases.size() == 1) {
if (_properties.use_compact_storage() && _column_aliases.size() == 1) {
if (defined_multi_cell_collections) {
throw exceptions::invalid_request_exception("Collection types are not supported with COMPACT STORAGE");
}
@@ -274,7 +268,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
types.emplace_back(type);
}
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
if (defined_multi_cell_collections) {
throw exceptions::invalid_request_exception("Collection types are not supported with COMPACT STORAGE");
}
@@ -287,7 +281,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
if (!_static_columns.empty()) {
// Only CQL3 tables can have static columns
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
throw exceptions::invalid_request_exception("Static columns are not supported in COMPACT STORAGE tables");
}
// Static columns only make sense if we have at least one clustering column. Otherwise everything is static anyway
@@ -296,7 +290,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
}
}
if (_use_compact_storage && !stmt->_column_aliases.empty()) {
if (_properties.use_compact_storage() && !stmt->_column_aliases.empty()) {
if (stmt->_columns.empty()) {
#if 0
// The only value we'll insert will be the empty one, so the default validator don't matter
@@ -322,7 +316,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
} else {
// For compact, we are in the "static" case, so we need at least one column defined. For non-compact however, having
// just the PK is fine since we have CQL3 row marker.
if (_use_compact_storage && stmt->_columns.empty()) {
if (_properties.use_compact_storage() && stmt->_columns.empty()) {
throw exceptions::invalid_request_exception("COMPACT STORAGE with non-composite PRIMARY KEY require one column not part of the PRIMARY KEY, none given");
}
#if 0
@@ -335,18 +329,18 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
}
// If we give a clustering order, we must explicitly do so for all aliases and in the order of the PK
if (!_defined_ordering.empty()) {
if (_defined_ordering.size() > _column_aliases.size()) {
if (!_properties.defined_ordering().empty()) {
if (_properties.defined_ordering().size() > _column_aliases.size()) {
throw exceptions::invalid_request_exception("Only clustering key columns can be defined in CLUSTERING ORDER directive");
}
int i = 0;
for (auto& pair: _defined_ordering){
for (auto& pair: _properties.defined_ordering()){
auto& id = pair.first;
auto& c = _column_aliases.at(i);
if (!(*id == *c)) {
if (find_ordering_info(c)) {
if (_properties.find_ordering_info(c)) {
throw exceptions::invalid_request_exception(sprint("The order of columns in the CLUSTERING ORDER directive must be the one of the clustering key (%s must appear before %s)", c, id));
} else {
throw exceptions::invalid_request_exception(sprint("Missing CLUSTERING ORDER for column %s", c));
@@ -371,12 +365,7 @@ data_type create_table_statement::raw_statement::get_type_and_remove(column_map_
}
columns.erase(t);
auto is_reversed = find_ordering_info(t);
if (!is_reversed) {
return type;
} else {
return *is_reversed ? reversed_type_impl::get_instance(type) : type;
}
return _properties.get_reversable_type(t, type);
}
void create_table_statement::raw_statement::add_definition(::shared_ptr<column_identifier> def, ::shared_ptr<cql3_type::raw> type, bool is_static) {
@@ -395,14 +384,6 @@ void create_table_statement::raw_statement::add_column_alias(::shared_ptr<column
_column_aliases.emplace_back(alias);
}
void create_table_statement::raw_statement::set_ordering(::shared_ptr<column_identifier> alias, bool reversed) {
_defined_ordering.emplace_back(alias, reversed);
}
void create_table_statement::raw_statement::set_compact_storage() {
_use_compact_storage = true;
}
}
}

View File

@@ -43,6 +43,7 @@
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/cf_properties.hh"
#include "cql3/statements/raw/cf_statement.hh"
#include "cql3/cql3_type.hh"
@@ -103,7 +104,7 @@ public:
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
schema_ptr get_cf_meta_data();
@@ -125,30 +126,22 @@ private:
shared_ptr_value_hash<column_identifier>,
shared_ptr_equal_by_value<column_identifier>>;
defs_type _definitions;
public:
const ::shared_ptr<cf_prop_defs> properties = ::make_shared<cf_prop_defs>();
private:
std::vector<std::vector<::shared_ptr<column_identifier>>> _key_aliases;
std::vector<::shared_ptr<column_identifier>> _column_aliases;
std::vector<std::pair<::shared_ptr<column_identifier>, bool>> _defined_ordering; // Insertion ordering is important
std::experimental::optional<bool> find_ordering_info(::shared_ptr<column_identifier> type) {
for (auto& t: _defined_ordering) {
if (*(t.first) == *type) {
return t.second;
}
}
return {};
}
create_table_statement::column_set_type _static_columns;
bool _use_compact_storage = false;
std::multiset<::shared_ptr<column_identifier>,
indirect_less<::shared_ptr<column_identifier>, column_identifier::text_comparator>> _defined_names;
bool _if_not_exists;
cf_properties _properties;
public:
raw_statement(::shared_ptr<cf_name> name, bool if_not_exists);
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
cf_properties& properties() {
return _properties;
}
data_type get_type_and_remove(column_map_type& columns, ::shared_ptr<column_identifier> t);
@@ -157,10 +150,6 @@ public:
void add_key_aliases(const std::vector<::shared_ptr<column_identifier>> aliases);
void add_column_alias(::shared_ptr<column_identifier> alias);
void set_ordering(::shared_ptr<column_identifier> alias, bool reversed);
void set_compact_storage();
};
}

View File

@@ -157,7 +157,7 @@ future<bool> create_type_statement::announce_migration(distributed<service::stor
}
shared_ptr<cql3::statements::prepared_statement>
create_type_statement::prepare(database& db) {
create_type_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_type_statement>(*this));
}

View File

@@ -69,7 +69,7 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
static void check_for_duplicate_names(user_type type);
private:

View File

@@ -0,0 +1,345 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <unordered_set>
#include <vector>
#include <boost/range/iterator_range.hpp>
#include <boost/range/join.hpp>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include "cql3/column_identifier.hh"
#include "cql3/restrictions/statement_restrictions.hh"
#include "cql3/statements/create_view_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/statements/select_statement.hh"
#include "cql3/statements/raw/select_statement.hh"
#include "cql3/selection/selectable.hh"
#include "cql3/selection/selectable_with_field_selection.hh"
#include "cql3/selection/selection.hh"
#include "cql3/selection/writetime_or_ttl.hh"
#include "cql3/util.hh"
#include "schema_builder.hh"
#include "service/storage_proxy.hh"
#include "validation.hh"
#include "db/config.hh"
#include "service/storage_service.hh"
namespace cql3 {
namespace statements {
create_view_statement::create_view_statement(
::shared_ptr<cf_name> view_name,
::shared_ptr<cf_name> base_name,
std::vector<::shared_ptr<selection::raw_selector>> select_clause,
std::vector<::shared_ptr<relation>> where_clause,
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys,
std::vector<::shared_ptr<cql3::column_identifier::raw>> clustering_keys,
bool if_not_exists)
: schema_altering_statement{view_name}
, _base_name{base_name}
, _select_clause{select_clause}
, _where_clause{where_clause}
, _partition_keys{partition_keys}
, _clustering_keys{clustering_keys}
, _if_not_exists{if_not_exists}
{
service::get_local_storage_proxy().get_db().local().get_config().check_experimental("Creating materialized views");
if (!service::get_local_storage_service().cluster_supports_materialized_views()) {
throw exceptions::invalid_request_exception("Can't create materialized views until the whole cluster has been upgraded");
}
}
future<> create_view_statement::check_access(const service::client_state& state) {
return state.has_column_family_access(keyspace(), _base_name->get_column_family(), auth::permission::ALTER);
}
void create_view_statement::validate(distributed<service::storage_proxy>&, const service::client_state& state) {
// validated in announceMigration()
}
static const column_definition* get_column_definition(schema_ptr schema, column_identifier::raw& identifier) {
auto prepared = identifier.prepare(schema);
assert(dynamic_pointer_cast<column_identifier>(prepared));
auto id = static_pointer_cast<column_identifier>(prepared);
return schema->get_column_definition(id->name());
}
static bool validate_primary_key(
schema_ptr schema,
const column_definition* def,
const std::unordered_set<const column_definition*>& base_pk,
bool has_non_pk_column,
const restrictions::statement_restrictions& restrictions) {
if (def->type->is_multi_cell()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot use MultiCell column '%s' in PRIMARY KEY of materialized view", def->name_as_text()));
}
if (def->is_static()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot use Static column '%s' in PRIMARY KEY of materialized view", def->name_as_text()));
}
if (base_pk.find(def) == base_pk.end()) {
if (has_non_pk_column) {
throw exceptions::invalid_request_exception(sprint(
"Cannot include more than one non-primary key column '%s' in materialized view primary key", def->name_as_text()));
}
return true;
}
// We don't need to include the "IS NOT NULL" filter on a non-composite partition key
// because we will never allow a single partition key to be NULL
if (schema->partition_key_columns().size() > 1 && !restrictions.is_restricted(def)) {
throw exceptions::invalid_request_exception(sprint(
"Primary key column '%s' is required to be filtered by 'IS NOT NULL'", def->name_as_text()));
}
return false;
}
future<bool> create_view_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) {
// We need to make sure that:
// - primary key includes all columns in base table's primary key
// - make sure that the select statement does not have anything other than columns
// and their names match the base table's names
// - make sure that primary key does not include any collections
// - make sure there is no where clause in the select statement
// - make sure there is not currently a table or view
// - make sure base_table gc_grace_seconds > 0
_properties.validate();
if (_properties.use_compact_storage()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot use 'COMPACT STORAGE' when defining a materialized view"));
}
// View and base tables must be in the same keyspace, to ensure that RF
// is the same (because we assign a view replica to each base replica).
// If a keyspace was not specified for the base table name, it is assumed
// it is in the same keyspace as the view table being created (which
// itself might be the current USEd keyspace, or explicitly specified).
if (_base_name->get_keyspace().empty()) {
_base_name->set_keyspace(keyspace(), true);
}
if (_base_name->get_keyspace() != keyspace()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot create a materialized view on a table in a separate keyspace ('%s' != '%s')",
_base_name->get_keyspace(), keyspace()));
}
auto&& db = proxy.local().get_db().local();
schema_ptr schema = validation::validate_column_family(db, _base_name->get_keyspace(), _base_name->get_column_family());
if (schema->is_counter()) {
throw exceptions::invalid_request_exception(sprint(
"Materialized views are not supported on counter tables"));
}
if (schema->is_view()) {
throw exceptions::invalid_request_exception(sprint(
"Materialized views cannot be created against other materialized views"));
}
if (schema->gc_grace_seconds().count() == 0) {
throw exceptions::invalid_request_exception(sprint(
"Cannot create materialized view '%s' for base table "
"'%s' with gc_grace_seconds of 0, since this value is "
"used to TTL undelivered updates. Setting gc_grace_seconds "
"too low might cause undelivered updates to expire "
"before being replayed.", column_family(), _base_name->get_column_family()));
}
// Gather all included columns, as specified by the select clause
auto included = boost::copy_range<std::unordered_set<const column_definition*>>(_select_clause | boost::adaptors::transformed([&](auto&& selector) {
if (selector->alias) {
throw exceptions::invalid_request_exception(sprint(
"Cannot use alias when defining a materialized view"));
}
auto selectable = selector->selectable_;
if (dynamic_pointer_cast<selection::selectable::with_field_selection::raw>(selectable)) {
throw exceptions::invalid_request_exception(sprint(
"Cannot select out a part of type when defining a materialized view"));
}
if (dynamic_pointer_cast<selection::selectable::with_function::raw>(selectable)) {
throw exceptions::invalid_request_exception(sprint(
"Cannot use function when defining a materialized view"));
}
if (dynamic_pointer_cast<selection::selectable::writetime_or_ttl::raw>(selectable)) {
throw exceptions::invalid_request_exception(sprint(
"Cannot use function when defining a materialized view"));
}
assert(dynamic_pointer_cast<column_identifier::raw>(selectable));
auto identifier = static_pointer_cast<column_identifier::raw>(selectable);
auto* def = get_column_definition(schema, *identifier);
if (!def) {
throw exceptions::invalid_request_exception(sprint(
"Unknown column name detected in CREATE MATERIALIZED VIEW statement : ", identifier));
}
return def;
}));
if (!get_bound_variables()->empty()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot use query parameters in CREATE MATERIALIZED VIEW statements"));
}
auto parameters = ::make_shared<raw::select_statement::parameters>(raw::select_statement::parameters::orderings_type(), false, true);
raw::select_statement raw_select(_base_name, std::move(parameters), _select_clause, _where_clause, nullptr);
raw_select.prepare_keyspace(keyspace());
raw_select.set_bound_variables({});
cql_stats ignored;
auto prepared = raw_select.prepare(db, ignored, true);
auto restrictions = static_pointer_cast<statements::select_statement>(prepared->statement)->get_restrictions();
auto base_primary_key_cols = boost::copy_range<std::unordered_set<const column_definition*>>(
boost::range::join(schema->partition_key_columns(), schema->clustering_key_columns())
| boost::adaptors::transformed([](auto&& def) { return &def; }));
if (_partition_keys.empty()) {
throw exceptions::invalid_request_exception(sprint("Must select at least a column for a Materialized View"));
}
if (_clustering_keys.empty()) {
throw exceptions::invalid_request_exception(sprint("No columns are defined for Materialized View other than primary key"));
}
// Validate the primary key clause, ensuring only one non-PK base column is used in the view's PK.
bool has_non_pk_column = false;
std::unordered_set<const column_definition*> target_primary_keys;
std::vector<const column_definition*> target_partition_keys;
std::vector<const column_definition*> target_clustering_keys;
auto validate_pk = [&] (const std::vector<::shared_ptr<cql3::column_identifier::raw>>& keys, std::vector<const column_definition*>& target_keys) mutable {
for (auto&& identifier : keys) {
auto* def = get_column_definition(schema, *identifier);
if (!def) {
throw exceptions::invalid_request_exception(sprint(
"Unknown column name detected in CREATE MATERIALIZED VIEW statement : ", identifier));
}
if (!target_primary_keys.insert(def).second) {
throw exceptions::invalid_request_exception(sprint(
"Duplicate entry found in PRIMARY KEY: ", identifier));
}
target_keys.push_back(def);
has_non_pk_column |= validate_primary_key(schema, def, base_primary_key_cols, has_non_pk_column, *restrictions);
}
};
validate_pk(_partition_keys, target_partition_keys);
validate_pk(_clustering_keys, target_clustering_keys);
std::vector<const column_definition*> missing_pk_columns;
std::vector<const column_definition*> target_non_pk_columns;
// We need to include all of the primary key columns from the base table in order to make sure that we do not
// overwrite values in the view. We cannot support "collapsing" the base table into a smaller number of rows in
// the view because if we need to generate a tombstone, we have no way of knowing which value is currently being
// used in the view and whether or not to generate a tombstone. In order to not surprise our users, we require
// that they include all of the columns. We provide them with a list of all of the columns left to include.
for (auto* def : schema->all_columns() | boost::adaptors::map_values) {
bool included_def = included.empty() || included.find(def) != included.end();
if (included_def && def->is_static()) {
throw exceptions::invalid_request_exception(sprint(
"Unable to include static column '%s' which would be included by Materialized View SELECT * statement", *def));
}
bool def_in_target_pk = std::find(target_primary_keys.begin(), target_primary_keys.end(), def) != target_primary_keys.end();
if (included_def && !def_in_target_pk) {
target_non_pk_columns.push_back(def);
} else if (def->is_primary_key() && !def_in_target_pk) {
missing_pk_columns.push_back(def);
}
}
if (!missing_pk_columns.empty()) {
auto column_names = ::join(", ", missing_pk_columns | boost::adaptors::transformed(std::mem_fn(&column_definition::name)));
throw exceptions::invalid_request_exception(sprint(
"Cannot create Materialized View %s without primary key columns from base %s (%s)",
column_family(), _base_name->get_column_family(), column_names));
}
schema_builder builder{keyspace(), column_family()};
auto add_columns = [this, &builder] (std::vector<const column_definition*>& defs, column_kind kind) mutable {
for (auto* def : defs) {
auto&& type = _properties.get_reversable_type(def->column_specification->name, def->type);
builder.with_column(def->name(), type, kind);
}
};
add_columns(target_partition_keys, column_kind::partition_key);
add_columns(target_clustering_keys, column_kind::clustering_key);
add_columns(target_non_pk_columns, column_kind::regular_column);
_properties.properties()->apply_to_builder(builder);
auto where_clause_text = util::relations_to_where_clause(_where_clause);
builder.with_view_info(schema->id(), schema->cf_name(), included.empty(), std::move(where_clause_text));
return make_ready_future<>().then([definition = view_ptr(builder.build()), is_local_only]() mutable {
return service::get_local_migration_manager().announce_new_view(definition, is_local_only);
}).then_wrapped([this] (auto&& f) {
try {
f.get();
return true;
} catch (const exceptions::already_exists_exception& e) {
if (_if_not_exists) {
return false;
}
throw e;
}
});
}
shared_ptr<transport::event::schema_change> create_view_statement::change_event() {
return make_shared<transport::event::schema_change>(transport::event::schema_change::change_type::CREATED, transport::event::schema_change::target_type::TABLE, keyspace(), column_family());
}
shared_ptr<cql3::statements::prepared_statement>
create_view_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_view_statement>(*this));
}
}
}

View File

@@ -0,0 +1,79 @@
/*
* This file is part of Scylla.
* Copyright (C) 2016 ScyllaDB
*
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/cf_properties.hh"
#include "cql3/cql3_type.hh"
#include "cql3/selection/raw_selector.hh"
#include "cql3/relation.hh"
#include "cql3/cf_name.hh"
#include "service/migration_manager.hh"
#include "schema.hh"
#include "core/shared_ptr.hh"
#include <utility>
#include <vector>
#include <experimental/optional>
namespace cql3 {
namespace statements {
/** A <code>CREATE MATERIALIZED VIEW</code> parsed from a CQL query statement. */
class create_view_statement : public schema_altering_statement {
private:
::shared_ptr<cf_name> _base_name;
std::vector<::shared_ptr<selection::raw_selector>> _select_clause;
std::vector<::shared_ptr<relation>> _where_clause;
std::vector<::shared_ptr<cql3::column_identifier::raw>> _partition_keys;
std::vector<::shared_ptr<cql3::column_identifier::raw>> _clustering_keys;
cf_properties _properties;
bool _if_not_exists;
public:
create_view_statement(
::shared_ptr<cf_name> view_name,
::shared_ptr<cf_name> base_name,
std::vector<::shared_ptr<selection::raw_selector>> select_clause,
std::vector<::shared_ptr<relation>> where_clause,
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys,
std::vector<::shared_ptr<cql3::column_identifier::raw>> clustering_keys,
bool if_not_exists);
auto& properties() {
return _properties;
}
// Functions we need to override to subclass schema_altering_statement
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
// FIXME: continue here. See create_table_statement.hh and CreateViewStatement.java
};
}
}

View File

@@ -46,8 +46,8 @@ namespace cql3 {
namespace statements {
delete_statement::delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs)
: modification_statement{type, bound_terms, std::move(s), std::move(attrs)}
delete_statement::delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, cql_stats& stats)
: modification_statement{type, bound_terms, std::move(s), std::move(attrs), &stats.deletes}
{ }
bool delete_statement::require_full_clustering_key() const {
@@ -80,10 +80,10 @@ void delete_statement::add_update_for_key(mutation& m, const exploded_clustering
namespace raw {
::shared_ptr<cql3::statements::modification_statement>
delete_statement::prepare_internal(database& db, schema_ptr schema, ::shared_ptr<variable_specifications> bound_names,
std::unique_ptr<attributes> attrs) {
delete_statement::prepare_internal(database& db, schema_ptr schema, shared_ptr<variable_specifications> bound_names,
std::unique_ptr<attributes> attrs, cql_stats& stats) {
using statement_type = cql3::statements::modification_statement::statement_type;
auto stmt = ::make_shared<cql3::statements::delete_statement>(statement_type::DELETE, bound_names->size(), schema, std::move(attrs));
auto stmt = ::make_shared<cql3::statements::delete_statement>(statement_type::DELETE, bound_names->size(), schema, std::move(attrs), stats);
for (auto&& deletion : _deletions) {
auto&& id = deletion->affected_column()->prepare_column_identifier(schema);

View File

@@ -56,7 +56,7 @@ namespace statements {
*/
class delete_statement : public modification_statement {
public:
delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs);
delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, cql_stats& stats);
virtual bool require_full_clustering_key() const override;

View File

@@ -99,7 +99,7 @@ shared_ptr<transport::event::schema_change> drop_keyspace_statement::change_even
}
shared_ptr<cql3::statements::prepared_statement>
drop_keyspace_statement::prepare(database& db) {
drop_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_keyspace_statement>(*this));
}

View File

@@ -63,7 +63,7 @@ public:
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -100,7 +100,7 @@ shared_ptr<transport::event::schema_change> drop_table_statement::change_event()
}
shared_ptr<cql3::statements::prepared_statement>
drop_table_statement::prepare(database& db) {
drop_table_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_table_statement>(*this));
}

View File

@@ -62,7 +62,7 @@ public:
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -79,6 +79,57 @@ void drop_type_statement::validate(distributed<service::storage_proxy>& proxy, c
throw exceptions::invalid_request_exception(sprint("No user type named %s exists.", _name.to_string()));
}
}
// We don't want to drop a type unless it's not used anymore (mainly because
// if someone drops a type and recreates one with the same name but different
// definition with the previous name still in use, things can get messy).
// We have two places to check: 1) other user type that can nest the one
// we drop and 2) existing tables referencing the type (maybe in a nested
// way).
// This code is moved from schema_keyspace (akin to origin) because we cannot
// delay this check to until after we've applied the mutations. If a type or
// table references the type we're dropping, we will a.) get exceptions parsing
// (can be translated to invalid_request, but...) and more importantly b.)
// we will leave those types/tables in a broken state.
// We managed to get through this before because we neither enforced hard
// cross reference between types when loading them, nor did we in fact
// probably ever run the scenario of dropping a referenced type and then
// actually using the referee.
//
// Now, this has a giant flaw. We are succeptible to race conditions here,
// since we could have a drop at the same time as a create type that references
// the dropped one, but we complete the check before the create is done,
// yet apply the drop mutations after -> inconsistent data!
// This problem is the same in origin, and I see no good way around it
// as long as the atomicity of schema modifications are based on
// actual appy of mutations, because unlike other drops, this one isn't
// benevolent.
// I guess this is one case where user need beware, and don't mess with types
// concurrently!
auto&& type = old->second;
auto&& keyspace = type->_keyspace;
auto&& name = type->_name;
for (auto&& ut : all_types | boost::adaptors::map_values) {
if (ut->_keyspace == keyspace && ut->_name == name) {
continue;
}
if (ut->references_user_type(keyspace, name)) {
throw exceptions::invalid_request_exception(sprint("Cannot drop user type %s.%s as it is still used by user type %s", keyspace, type->get_name_as_string(), ut->get_name_as_string()));
}
}
for (auto&& cfm : ks.metadata()->cf_meta_data() | boost::adaptors::map_values) {
for (auto&& col : cfm->all_columns()) {
if (col.second->type->references_user_type(keyspace, name)) {
throw exceptions::invalid_request_exception(sprint("Cannot drop user type %s.%s as it is still used by table %s.%s", keyspace, type->get_name_as_string(), cfm->ks_name(), cfm->cf_name()));
}
}
}
} catch (no_such_keyspace& e) {
throw exceptions::invalid_request_exception(sprint("Cannot drop type in unknown keyspace %s", keyspace()));
}
@@ -118,7 +169,7 @@ future<bool> drop_type_statement::announce_migration(distributed<service::storag
}
shared_ptr<cql3::statements::prepared_statement>
drop_type_statement::prepare(database& db) {
drop_type_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_type_statement>(*this));
}

Some files were not shown because too many files have changed in this diff Show More