Compare commits

...

1039 Commits

Author SHA1 Message Date
Avi Kivity
bd807660e6 Update seastar submodule
* seastar 63079b7...725269f (1):
  > build: work around ragel 7 generated code bug
2017-08-09 14:22:28 +03:00
Avi Kivity
49894b2504 Update seastar submodule
* seastar 4a31f27...63079b7 (2):
  > build: export full cflags in pkgconfig file
  > build: disable -Wattributes when gcc -fvisibility=hidden bug strikes

Allows building with gcc 6.4.
2017-08-09 13:08:55 +03:00
Avi Kivity
79990908f4 Revert "sstable: close sstable_writer's file if writing of sstable fails."
This reverts commit 86cecd0d7d. Was already there as
3edcee4821.
2017-07-19 17:47:23 +03:00
Gleb Natapov
86cecd0d7d sstable: close sstable_writer's file if writing of sstable fails.
Failing to close a file properly before destroying file's object causes
crashes.

[tgrabiec: fixed typo]

Message-Id: <20170221144858.GG11471@scylladb.com>
(cherry picked from commit 0977f4fdf8)
2017-07-19 17:28:42 +03:00
Raphael S. Carvalho
dd7052023b tests/sstable_test: fix compaction_manager_test
after 'compaction: make major compaction go through compaction manager',
the test fails because task is preempted in debug mode before it reaches
intruction to increase stat.

Backport from 8dfb5f9c33
Included single-line change in compaction_manager_test from
8b0e358d73 that fixes release mode too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170714194946.12467-1-raphaelsc@scylladb.com>
2017-07-15 08:05:35 +03:00
Pekka Enberg
f8b81c2565 release: prepare for 1.6.6 2017-07-14 12:12:32 +03:00
Duarte Nunes
cb55afd824 storage_proxy: Preserve replica order across mutations
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.

There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.

This patch fixes this and enforces the expected order.

Fixes #2531
Fixes #2593

Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162400.15968-1-duarte@scylladb.com>
2017-07-14 10:12:01 +02:00
Raphael S. Carvalho
d8e86faf95 compaction: make major compaction go through compaction manager
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.

Fixes #1156.

NOTE: resolved conflict because in this version manager doesn't
stop transportation services for io errors.

(cherry picked from commit 3286f7aaa6)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170614211923.9955-2-raphaelsc@scylladb.com>
2017-06-15 12:42:48 +03:00
Raphael S. Carvalho
159fa22082 database: serialize sstable cleanup
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.

Fixes #192.

(cherry picked from commit 7deeffc953)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170614211923.9955-1-raphaelsc@scylladb.com>
2017-06-15 12:42:40 +03:00
Calle Wilund
8ef146d6ca commitlog_test: Fix test_commitlog_delete_when_over_disk_limit
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
    end and inbetwen. I.e. the test sort of assumed we started
    with < 2 (or so) segments. Not always the case (timing)

Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 0c598e5645)
2017-06-13 19:53:32 +03:00
Paweł Dziepak
1b9e0766b3 commitlog: avoid copying column_mapping
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.

This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.

Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
          before       after      diff
write   79434.35    89247.54    +12.3%

(cherry picked from commit 374c8a56ac)

Also: Fixes #2468.
2017-06-11 15:56:48 +03:00
Paweł Dziepak
67f7932a35 idl: fix generated writers when member functions are used
When using member name in an idetifer of generated class or method
idl compiler should strip the trailing '()'.

(cherry picked from commit 4df4994b71)

(part of #2468)
2017-06-11 15:56:45 +03:00
Paweł Dziepak
991d6bdb77 idl: add start_frame() overload for seastar::simple_output_stream
(cherry picked from commit 018d16d315)

(part of #2468)
2017-06-11 15:56:43 +03:00
Pekka Enberg
9699319194 release: prepare for 1.6.5 2017-06-09 07:53:34 +03:00
Raphael S. Carvalho
788b7bfb46 db: fix computation of live disk usage stat after compaction
sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.

Fixes #1592

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
(cherry picked from commit 3b5ad23532)
2017-05-28 11:32:01 +03:00
Raphael S. Carvalho
535e9ff95f db: remove partial sstable created by memtable flush which failed
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.

Fixes #2407.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
(cherry picked from commit b7e1575ad4)
2017-05-25 11:50:32 +03:00
Asias He
1648301411 repair: Fix partition estimation
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.

The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.

Fixes #2299

Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
(cherry picked from commit 66e3b73b9c)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

Conflicts:
	repair/repair.cc

Glauber: conflict is due to the fact that dht::token_range is still called nonwrapping_range
Message-Id: <20170522015356.4236-1-glauber@scylladb.com>
2017-05-22 08:31:43 +03:00
Pekka Enberg
43a62ab5e1 Revert "repair: Fix partition estimation"
This reverts commit 54db1b5452. Glauber writes:

 "I forgot one git ammend, as it seems. The token type had to be changed
  in two places for it to compile, and I ended up including just one. I
  will send another fixed version soon."
2017-05-22 08:31:12 +03:00
Tomasz Grabiec
42a4fbe528 row_cache: Fix undefined behavior in read_wide()
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.

Fixes #2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0351ab8bc6)
2017-05-21 19:14:23 +03:00
Gleb Natapov
38e9683753 database: remove temporary sstables sequentially
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.

Fixes #2384.

Message-Id: <20170517111722.GY3874@scylladb.com>
2017-05-21 17:12:19 +03:00
Asias He
54db1b5452 repair: Fix partition estimation
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.

The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.

Fixes #2299

Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
(cherry picked from commit 66e3b73b9c)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

Conflicts:
	repair/repair.cc

Glauber: conflict is due to the fact that dht::token_range is still called nonwrapping_range
2017-05-21 17:04:09 +03:00
Tomasz Grabiec
f8f1d1d300 Update seastar submodule head
* dist/ami/files/scylla-ami d5a4397...407e8f3 (4):
  > scylla_create_devices: check block device is exists
  > Rewrite disk discovery to handle EBS and NVMEs.
  > add --developer-mode option
  > trivial cleanup: replace tab in indent

* seastar 5810d50...4a31f27 (6):
  > tls: make shutdown/close do "clean" handshake shutdown in background
  > tls: Make sink/source (i.e. streams) first class channel owners
  > tls.cc: Fix shutdown_input/output to conform with expected socket behaviour
  > net/*: qualify net::* refs better -> ::net::* etc
  > native-stack: Make sink/source (i.e. streams) first class channel owners
  > posix-stack: Make sink/source (i.e. streams) first class channel owners
2017-05-19 14:19:40 +02:00
Pekka Enberg
c257850e25 Merge "gossip mark alive fixes" from Asias
"This series fixes the user after free issue in gossip and elimates the
duplicated / unnecessary mark alive operations.

Fixes #2341"

* tag 'asias/gossip_fix_mark_alive/v1' of github.com:cloudius-systems/seastar-dev:
  gossip: Ignore callbacks and mark alive operation in shadow round
  gossip: Ingore the duplicated mark alive operation
  gossip: Fix user after free in mark_alive

(cherry picked from commit 1e04731fa0)
2017-05-09 01:57:46 +03:00
Pekka Enberg
e8e5c88a02 Update seastar submodule
* seastar 8e54a9b...5810d50 (1):
  > scripts: posix_net_conf.sh: get rid of funny syntax

Fixes #2269
2017-05-03 10:34:50 +03:00
Avi Kivity
4f4edec8e1 Merge "] Fix problems with slicing using sstable's promoted index" from Tomasz
"Fixes #2327.
Fixes #2326."

* 'tgrabiec/fix-promoted-index-parsing-1.7' of github.com:cloudius-systems/seastar-dev:
  sstables: Fix incorrect parsing of cell names in promoted index
  sstables: Fix find_disk_ranges() to not miss relevant range tombstones

(cherry picked from commit ea0591ad3d)
2017-04-30 17:04:21 +03:00
Tomasz Grabiec
65862c4221 sstables: Fix usage of wrong comparator in find_disk_ranges()
This made a difference if clustering restriction bounds were not full
keys but prefixes.

Fixes #2272.

Message-Id: <1493058357-24156-1-git-send-email-tgrabiec@scylladb.com>
2017-04-24 21:56:33 +03:00
Pekka Enberg
7672c59873 release: prepare for 1.6.4 2017-04-24 19:39:09 +03:00
Duarte Nunes
19212a059f alter_type_statement: Fix signed to unsigned conversion
This could allow us to alter a non-existing field of an UDT.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170419114254.5582-1-duarte@scylladb.com>
(cherry picked from commit e06bafdc6c)
2017-04-19 14:48:38 +03:00
Raphael S. Carvalho
bbbb4dffbd partitioned_sstable_set: fix quadratic space complexity
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.

Fixes #2287.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
(cherry picked from commit 11b74050a1)
2017-04-18 13:14:13 +03:00
Asias He
2f0970e83c gossip: Fix possible use-after-free of entry in endpoint_state_map
We take a reference of endpoint_state entry in endpoint_state_map. We
access it again after code which defers, the reference can be invalid
after the defer if someone deletes the entry during the defer.

Fix this by checking take the reference again after the defering code.

I also audited the code to remove unsafe reference to endpoint_state_map entry
as much as possible.

Fixes the following SIGSEGV:

Core was generated by `/usr/bin/scylla --log-to-syslog 1 --log-to-stdout
0 --default-log-level info --'.
Program terminated with signal SIGSEGV, Segmentation fault.
(this=<optimized out>) at /usr/include/c++/5/bits/stl_pair.h:127
127     in /usr/include/c++/5/bits/stl_pair.h
[Current thread is 1 (Thread 0x7f1448f39bc0 (LWP 107308))]

Fixes #2271

Message-Id: <529ec8ede6da884e844bc81d408b93044610afd2.1491960061.git.asias@scylladb.com>
(cherry picked from commit d27b47595b)
2017-04-13 13:18:54 +03:00
Pekka Enberg
5ab8601f64 release: prepare for 1.6.3 2017-04-04 14:10:48 +03:00
Takuya ASADA
993d216f23 dist/debian/debian/scylla-server.upstart: export SCYLLA_CONF, SCYLLA_HOME
We are sourcing sysconfig file on upstart, but forgot to load them as
environment variables.
So export them.

Fixes #2236

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491209505-32293-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit b087616a6c)
2017-04-04 11:00:17 +03:00
Calle Wilund
630425f513 messaging_service: Move log printout to actual listen start
Fixes  #1845
Log printout was before we actually had evaluated endpoint
to create, thus never included SSL info.
Message-Id: <1487766738-27797-1-git-send-email-calle@scylladb.com>

(cherry picked from commit d5f57bd047)
(cherry picked from commit 8c0488bce9)
2017-03-30 18:29:32 +03:00
Calle Wilund
1626c9ce5e commitlog/replayer: Bugfix: minimum rp broken, and cl reader offset too
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.

This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.

Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.

I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.

v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>

(cherry picked from commit b12b65db92)
2017-03-28 11:16:17 +02:00
Calle Wilund
bc6def4a0f commitlog_replayer: Do proper const-loopup of min positions for shards
Fixes #2173

Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.

Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.

Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
(cherry picked from commit c3a510a08d)
2017-03-22 18:46:17 +02:00
Vlad Zolotarov
1e77bcd60f Don't report a Tracing session ID unless the current query had a Tracing bit in its flags
Although the current master's behaviour is legal it's suboptimal and some Clients are sensitive to that.
Let's fix that.

Fixes #2179

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490115157-4657-1-git-send-email-vladz@scylladb.com>
2017-03-22 14:56:20 +02:00
Pekka Enberg
f9bcedfbf7 dist/docker: Expose Prometheus port by default
This patch exposes Scylla's Prometheus port by default. You can now use
the Scylla Monitoring project with the Docker image:

  https://github.com/scylladb/scylla-grafana-monitoring

To configure the IP addresses, use the 'docker inspect' command to
determine Scylla's IP address (assuming your running container is called
'some-scylla'):

  docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla

and then use that IP address in the prometheus/scylla_servers.yml
configuration file.

Fixes #1827

Message-Id: <1490008357-19627-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 85a127bc78)
2017-03-20 15:30:28 +02:00
Calle Wilund
c814cb3433 commitlog_replayer: Make replay parallel per shard
Fixes #2098

Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.

v2:
* Fixed whitespace errors

Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 078589c508)
2017-03-20 12:50:43 +02:00
Amos Kong
62143d26b4 scylla_setup: match '-p' option of lsblk with strict pattern
On Ubuntu 14.04, the lsblk doesn't have '-p' option, but
`scylla_setup` try to get block list by `lsblk -pnr` and
trigger error.

Current simple pattern will match all help content, it might
match wrong options.
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e -p
   -m, --perms          output info about permissions
   -P, --pairs          use key="value" output format

Let's use strict pattern to only match option at the head. Example:
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e '^\s*-D'
   -D, --discard        print discard capabilities

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <4f0f318353a43664e27da8a66855f5831457f061.1489712867.git.amos@scylladb.com>
(cherry picked from commit 468df7dd5f)
2017-03-20 11:16:21 +02:00
Amnon Heiman
7a1ceadec4 storage_proxy: metrics should have unique name
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.

Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.

This could be related to scylladb/seastar#250

Fixes #2163.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
(cherry picked from commit 295a981c61)
2017-03-19 18:11:58 +02:00
Pekka Enberg
d41ba3ba78 release: prepare for 1.6.2 2017-03-10 11:24:19 +02:00
Asias He
0e11c6218a repair: Fix midpoint is not contained in the split range assertion in split_and_add
We have:

  auto halves = range.split(midpoint, dht::token_comparator());

We saw a case where midpoint == range.start, as a result, range.split
will assert becasue the range.start is marked non-inclusive, so the
midpoint doesn't appear to be contain()ed in the range - hence the
assertion failure.

Fixes #2148

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>
(cherry picked from commit 39d2e59e7e)
2017-03-09 10:38:54 +02:00
Paweł Dziepak
a096a173bf Merge "Avoid loosing changes to keyspace parameters of system_auth and tracing keyspaces" form Tomek
"If a node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.

To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.

Fixes #2129."

* tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev:
  db: Create default auth and tracing keyspaces using lowest timestamp
  migration_manager: Append actual keyspace mutations with schema notifications

(cherry picked from commit 6db6d25f66)
2017-03-08 20:37:10 +01:00
Nadav Har'El
13d52a8172 sstable decompression: fix skip() to end of file
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.

Fixes #2143

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
(cherry picked from commit 506e074ba4)
2017-03-08 12:35:47 +02:00
Gleb Natapov
14b8a49b3f memtable: do not open code logalloc::reclaim_lock use
logalloc::reclaim_lock prevents reclaim from running which may cause
regular allocation to fail although there is enough of free memory.
To solve that there is an allocation_section which acquire reclaim_lock
and if allocation fails it run reclaimer outside of a lock and retries
the allocation. The patch make use of allocation_section instead of
direct use of reclaim_lock in memtable code.

Fixes #2138.

Message-Id: <20170306160050.GC5902@scylladb.com>
(cherry picked from commit d7bdf16a16)
2017-03-07 11:16:47 +02:00
Gleb Natapov
e437edafe6 memtable: do not yield while holding reclaim_lock
Holding reclaim_lock while yielding may cause memory allocations to
fail.

Fixes #2139

Message-Id: <20170306153151.GA5902@scylladb.com>
(cherry picked from commit 5c4158daac)
2017-03-06 18:36:20 +02:00
Takuya ASADA
b91456da90 dist/debian/dep: fix broken link of gcc-5, update it to 5.4.1-5
Since gcc-5/stretch=5.4.1-2 removed from apt repository, we nolonger able to
build gcc-5.

To avoid dead link, use launchpad.net archives instead of using apt-get source.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488189378-5607-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ba323e2074)
2017-03-06 16:32:22 +02:00
Tomasz Grabiec
ebb404275d db: Fix overflow of gc_clock time point
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.

I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.

Fixes the following UBSAN report:

  /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'

Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 4b6e77e97e)
2017-03-01 18:50:42 +02:00
Tomasz Grabiec
92e6c62c1c query: Fix invalid initialization of _memory_tracker by moving-from-self
Fixes the following UBSAN warning:

  core/semaphore.hh:293:74: runtime error: reference binding to misaligned address 0x0000006c55d7 for type 'struct basic_semaphore', which requires 8 byte alignment

Since the field was not initialied properly, probably also fixes some
user-visible bug.
Message-Id: <1488368222-32009-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0c84f00b16)
2017-03-01 11:57:21 +00:00
Gleb Natapov
3edcee4821 sstable: close sstable_writer's file if writing of sstable fails.
Failing to close a file properly before destroying file's object causes
crashes.

[tgrabiec: fixed typo]

Message-Id: <20170221144858.GG11471@scylladb.com>
(cherry picked from commit 0977f4fdf8)
2017-02-28 11:10:17 +02:00
Avi Kivity
9c148d6e4c Update seastar submodule
* seastar ee84851...8e54a9b (1):
  > fix append_challenged_posix_file_impl::process_queue() to handle recursion

Fixes #2121.
2017-02-28 10:57:22 +02:00
Shlomi Livne
2ea4da28e8 release: prepare for 1.6.1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-02-16 22:37:07 +02:00
Takuya ASADA
a04722aeb2 dist/redhat: stop backporting ninja-build from Fedora, install it from EPEL instead
ninja-build-1.6.0-2.fc23.src.rpm on fedora web site deleted for some
reason, but there is ninja-build-1.7.2-2 on EPEL, so we don't need to
backport from Fedora anymore.

Fixes #2087

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487155729-13257-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 9c8515eeed)
2017-02-15 13:01:19 +02:00
Avi Kivity
41b8dc3b84 Update seastar submodule
* seastar 0bfd7fe...ee84851 (1):
  > prometheus: send one MetricFamily per unique metric name

Fixes #2077.
Fixes #2078.
2017-02-13 16:12:26 +02:00
Avi Kivity
b6ebe2e20b Merge "Avoid avalanche of tasks after memtable flush" from Tomasz
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()

(cherry picked from commit 7a00dd6985)
2017-02-02 22:19:25 +01:00
Pekka Enberg
7e1b245887 release: prepare for 1.6.0 2017-02-01 13:58:06 +02:00
Takuya ASADA
83fc7de65f dist: add lspci on dependencies, since it used by dpdk-devbind.py
On minimum setup environment scylla_sysconfig_setup will fail because lspci command is not installed. So install it on package installation time.

Fixes #2035

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485327435-20543-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit bce0fb3fa2)
2017-01-25 17:37:18 +02:00
Pekka Enberg
d4781f2de3 release: prepare for 1.6.rc2 2017-01-24 14:33:12 +02:00
Amos Kong
2193a83a82 dist/redhat: fix path of housekeeping.cfg
scylla-housekeeping[3857]: Config file /etc/scylla.d/housekeeping.cfg is missing, terminating

Housekeeping failed to execute for missing the config file,
the config file should be in /etc/scylla.d/.

Fixes #2020

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <e63f2f8cb94410a6dca4e6193932f0079755ad47.1484724328.git.amos@scylladb.com>
(cherry picked from commit b880bdccef)
2017-01-19 11:09:41 +02:00
Pekka Enberg
78d74bf23a dist/docker: Use Scylla 1.6 RPM repository 2017-01-18 11:58:42 +02:00
Pekka Enberg
642a479c73 release: prepare for 1.6.rc1 2017-01-16 19:02:05 +02:00
Tomasz Grabiec
8a6d0ad2fa storage_proxy: Fix capturing of on-stack variable by reference
partition_range_count was accepted by do_with callback by value and
then captured by reference by async code, thus invoking use after
destroy.

Message-Id: <1484317846-14485-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3c3a4358ae)
2017-01-16 11:49:34 +02:00
Tomasz Grabiec
37f73781ee storage_proxy: Add missing initialization of _short_read_allowed
Dropped by a1cafed370 ("storage_proxy:
handle range scans of sparsely populated tables").

Fixes the failure in update_cluster_layout_tests.TestUpdateClusterLayout test.

Message-Id: <1484317450-13525-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 66547e7d7c)
2017-01-13 16:49:30 +02:00
Takuya ASADA
6fd5442fb7 scylla-housekeeping: move uuid file to /var/lib/scylla-housekeeping
Since scylla-housekeeping running as scylla user, it doesn't have a permission
to create a file on /etc/scylla.d.
So introduce /var/lib/scylla-housekeeping which owns by scylla user, place uuid
file on the directory.

Fixes #2009

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484235946-12463-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit bee7f549a9)
2017-01-13 16:28:36 +02:00
Avi Kivity
e2777e508c Update seastar submodule
* seastar 2909f6c...0bfd7fe (1):
  > io_queue: remove owner number from metric name
2017-01-12 16:46:25 +02:00
Tomasz Grabiec
8cf7bbf208 storage_proxy: Fix use-after-free on one_or_two_partition_ranges
query_mutations_locally() takes one_or_two_partition_ranges by
reference and requires, indirectly, that it is kept alive until
operation resolves. However, we were passing expiring value to it, the
result of unwrap().

Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test

Another potential problem was that we were dereferencing "s" in the same
expression which move-constructs an argument out of it.

Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 1e8151b4f2)
2017-01-12 15:12:06 +02:00
Vlad Zolotarov
4b5742d3a6 tracing::trace_keyspace_helper: use generate_legacy_id() for CF IDs generation
Explicitly generate tables' IDs of tables from the system_traces KS  using
generate_legacy_id() in order to ensure all Nodes create these tables with
the same IDs.

This is going to prevent hitting issue #420.

Fixes #1976

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1484153725-31030-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit ca0a0f1458)
2017-01-12 11:36:52 +02:00
Avi Kivity
cf27d44412 config: disable new sharding algorithm
It still has problems:
 - while resharding a very large leveled compaction strategy table, a huge
   amount of tiny sstables are generated, overwhelming the file descriptor
   limits
 - there is a large impact on read latency while resharding is going on
2017-01-12 10:15:51 +02:00
Avi Kivity
7b40e19561 Update seastar submodule
* seastar 0b49f28...2909f6c (2):
  > file/dup: don't decrease refcnt twice when file is explicitly closed
  > file: add dup() support

Preparing for file descriptor reduction during resharding backport.
2017-01-10 16:59:26 +02:00
Duarte Nunes
210e66b2b8 query_pagers: Fix over-counting of rows
This patch fixes a regression introduced in 0518895, where we counted
one extra row per partition when it contained live, non static rows.

We also simplify the visitor logic further, since now we don't need to
count rows one by one. Also remove a bunch of unused fields.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482234083-2447-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit d7e607ff51)
2017-01-10 16:54:09 +02:00
Amnon Heiman
6af51c1b1d scylla_setup: remove the uuid file creation
Scylla housekeeping can crete a uuid file if it is missing. There is no
longer need to create one for it.

Fixes #2004

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-3-git-send-email-amnon@scylladb.com>
(cherry picked from commit 8cd3d7445c)
2017-01-09 17:00:49 +02:00
Amnon Heiman
05f2bf5bd5 scylla-housekeeping: Create a uuid file if one is missing
This patch gets housekeeping to create a uuid file if a path to a uuid
file is upplied but the file is missing.

Because it import the uuid lib, uuid parameters where renamed.

Fixes #1987

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-2-git-send-email-amnon@scylladb.com>
(cherry picked from commit 32888fc0aa)
2017-01-09 15:27:04 +02:00
Avi Kivity
8002326f80 storage_proxy: prevent short read due to buffer size limit from being swallowed during range scan
mutation_result_merger::get() assumes that the merged result may be a
short read if at least one of the partial results is a short read (in
other words, if none of the partial results is a short read, then the
merged result is also not a short read). However this is not true;
because we update the memory accounter incrementally, we may stop
scanning early. All the partial results are full; but we did not scan
the entire range.

Fix by changing the short_read variable initialization from `no`
(which assumes we'll encounter a short read indication when processing
one of the batches) to `this->short_read()`, which also takes into
account the memory accounter.

Fixes #2001.
Message-Id: <20170108111315.17877-1-avi@scylladb.com>

(cherry picked from commit 8f36dca6f1)
2017-01-09 09:27:56 +00:00
Tomasz Grabiec
a00a1a1044 db: Make system tables use the commitlog
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.

Fixes #1986.

Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.

Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.

Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit cd630fece6)
2017-01-05 14:54:11 +02:00
Avi Kivity
337d6fb2cf storage_proxy: fix result ordering for parallel partition range scans
During a range scan, we try to avoid sorting according to partition range
when we can do so.  This is when we scan fewer than smp::count shards --
each shard's range is strictly ordered with respect to the others.

However, we use the wrong key for the sort -- we use the shard number.  But
if we started at shard s > 0 and wrapped around to shard 0, then shard 0's
range will be after the range belonging to shard s, but will sort before it.

Fix by storing the iteration order as the sort key.  We use that when we
know that shards do not overlap (shards < smp::count) and the index within
the source partition range vector when they do.

Fixes #1998.
Message-Id: <20170105114253.17492-1-avi@scylladb.com>

(cherry picked from commit eb520e7352)
2017-01-05 12:56:33 +01:00
Avi Kivity
1a8211e573 result_memory_tracker: fix too-short short reads
1.6 truncates paged queries early to avoid overrunning server memory
with too-large query results, but in the case of partition range queries,
this terminates too early due to an uninitialized variable holding the
maximum result size.  This results in slow performance due to additional
round trips.

Fix by initializing the maximum result size from the result_memory_tracker
running on the coordinating shard.

Fixes #1995.
Message-Id: <20170105103915.10633-1-avi@scylladb.com>

(cherry picked from commit 4667641f5f)
2017-01-05 11:04:03 +00:00
Takuya ASADA
a1a6c10964 dist/redhat: add python-setuptools on dependency since it requires for scylla-housekeeping
scylla-housekeeping breaks when python-setuptools doesn't installed, so
add it on dependency.

Fixes #1884

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483525828-7507-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 43655512e1)
2017-01-04 14:32:52 +02:00
Avi Kivity
c692824786 Update seastar submodule
* seastar 1e45fc8...0b49f28 (2):
  > metrics: Metrics function should take variable as a refernce
  > collectd: create metrics with the right format
2017-01-04 12:41:33 +02:00
Gleb Natapov
0ccdbbf1af storage_proxy: do not deref unengaged stdx:optional
Fixes intentional short reads.

Message-Id: <20161227142133.GE1829@scylladb.com>
(cherry picked from commit 4ca58959ad)
2017-01-01 12:16:45 +02:00
Takuya ASADA
138ad64cbc dist/ubuntu: check lsb_release existance since it's not included minimal Debian installation
Ubuntu has it in minimal installation but Debian doesn't, so add it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483003565-2753-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit e48cc9cf01)
2016-12-29 16:40:20 +02:00
Pekka Enberg
01737a51a9 tracing: Add seastar/core/scollectd.hh include
Fix the following build breakage:

FAILED: build/release/gen/cql3/CqlParser.o
g++ -MMD -MT build/release/gen/cql3/CqlParser.o -MF build/release/gen/cql3/CqlParser.o.d -std=gnu++1y -g  -Wall -Werror -fvisibility=hidden -pthread -I/home/penberg/scylla/seastar -I/home/penberg/scylla/seastar/fmt -I/home/penberg/scylla/seastar/build/release/gen  -march=nehalem -Ifmt -DBOOST_TEST_DYN_LINK -Wno-overloaded-virtual -DFMT_HEADER_ONLY -DHAVE_HWLOC -DHAVE_NUMA -DHAVE_LZ4_COMPRESS_DEFAULT  -O2 -DBOOST_TEST_DYN_LINK  -Wno-maybe-uninitialized -DHAVE_LIBSYSTEMD=1 -I. -I build/release/gen -I seastar -I seastar/build/release/gen -c -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp
In file included from ./query-request.hh:31:0,
                 from ./locator/token_metadata.hh:51,
                 from ./locator/abstract_replication_strategy.hh:29,
                 from ./database.hh:26,
                 from ./service/storage_proxy.hh:44,
                 from ./db/schema_tables.hh:43,
                 from ./db/system_keyspace.hh:46,
                 from ./cql3/functions/function_name.hh:45,
                 from ./cql3/selection/selectable.hh:48,
                 from ./cql3/selection/writetime_or_ttl.hh:45,
                 from build/release/gen/cql3/CqlParser.hpp:63,
                 from build/release/gen/cql3/CqlParser.cpp:44:
./tracing/tracing.hh:357:5: error: ‘scollectd’ does not name a type
     scollectd::registrations _registrations;
     ^~~~~~~~~

Message-Id: <1482939751-8756-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit a443dfa95e)
2016-12-28 19:16:30 +02:00
Pekka Enberg
9753a39284 Update seastar submodule
* seastar 0b98024...1e45fc8 (1):
  > Merge "migrate network related seastar collectd metrics to the new metrics registration API" from Vlad
2016-12-28 17:06:06 +02:00
Avi Kivity
42a76567b7 dht: use nonwrapping_ranges in ring_position_range_sharder
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges.  Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>

(cherry picked from commit 8686a59ea5)
2016-12-27 19:16:26 +02:00
Avi Kivity
725949e8bf Merge "Fixes for intentional short reads" from Paweł
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.

I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.

It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."

* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
  query: introduce result_memory_accounter::foreign_state
  storage_proxy: fix short reads in parallel range queries
  storage_proxy: pass maximum result size to replicas
  mutation_partition: use result limiter for digest reads
  query: make result_memory_limiter constants available for linker
  result_memory_limiter: add accounter for digest reads
  idl: allow writers to use any output stream
  result_memory_limiter: split new_read() to new_{data, mutation}_read()
  idl: is_short_read() was added in 1.6
  mutation_partition: honour allowed_short_read for static rows
  storage_proxy: fix _is_short_read computation
  storage_proxy: disallow short reads if got no live rows
  storage_proxy: don't stop after result with no live rows

(cherry picked from commit 868b4d110c)
2016-12-27 19:16:15 +02:00
Amnon Heiman
231cf22c0e Set the prometheus prefix to scylla
This patch make the prometheus prefix configurable and set the default
value to scylla.

Fixes #1964

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>
(cherry picked from commit 70b2a1bfd4)
2016-12-27 17:57:08 +02:00
Takuya ASADA
ef0ffd1cbb dist/common/scripts/scylla_setup: improve the message of disk selection prompt
Not to confuse users, describe we only list up unmounted disks.

Fixes #1841

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479720708-6021-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 7c3b98806d)
2016-12-27 17:56:11 +02:00
Gleb Natapov
2b67c65eb6 messaging_service: move MUTATION_DONE messages to separate connection
If a node gets more MUTATION request that it can handle via RPC it will
stop reading from this RPC connection, but this will prevent it from
getting MUTATION_DONE responses for requests it coordinates because
currently MUTATION and MUTATION_DONE messages shares same connection.

To solve this problem this patches moves MUTATION_DONE messages to
separate connection.

Fixes: #1843

Message-Id: <20161201155942.GC11581@scylladb.com>
(cherry picked from commit 0a2dd39c75)
2016-12-27 17:55:01 +02:00
Piotr Jastrzebski
5b0971f82f mutation_partition: don't use unique_ptr to manage LSA objects
Unique_ptr won't destruct them correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>
(cherry picked from commit 3e502de153)
2016-12-27 17:54:45 +02:00
Raphael S. Carvalho
da843239bf sstables: fix calculation of memory footprint for summary
size of keys weren't taken into account, so value reported
via collectd is much smaller than actual footprint.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3ca24612e4e84d1cbdea4f2d79e431a4f4479291.1482255327.git.raphaelsc@scylladb.com>
(cherry picked from commit e28537b56f)
2016-12-27 17:54:35 +02:00
Avi Kivity
5c96b04f4d Revert "config, dht: reduce default msb ignore bits to 4"
This reverts commit b81a57e8eb.

With exponential range scanning, we should now be able to survive
msb ignore bits of 12, which allows better sharding on large clusters.

(cherry picked from commit 3989e4ed15)
2016-12-27 17:54:09 +02:00
Avi Kivity
a1d463900f storage_proxy: handle range scans of sparsely populated tables
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard.  With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.

Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).

If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation.  This
optimizes for the dense(r) case, where few partial results are required.

If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys.  This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.

We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.

[tgrabiec: trivial conflicts]

Message-Id: <20161220170532.25173-1-avi@scylladb.com>
(cherry picked from commit a1cafed370)
2016-12-27 16:57:18 +02:00
Avi Kivity
66598a68d5 tests: adjust mutation_query_test for partition and row limits
Won't build otherwise.

(cherry picked from commit b740aff777)
2016-12-27 16:53:35 +02:00
Avi Kivity
8ad0e96025 Merge "storage_proxy: Enforce row limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. Uppers layers like the pager may
already be depending on this behavior."

* 'enforce-row-limit/v3' of https://github.com/duarten/scylla:
  query_pagers: Don't trim returned rows
  select_statement: Don't always trim result set
  query_result_merger: Limit rows
  mutation_query: to_data_query_result enforces row limit

(cherry picked from commit 3421ebe8be)
2016-12-27 16:36:41 +02:00
Avi Kivity
1a2a63787a Merge "storage_proxy: Enforce partition limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."

* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
  storage_proxy: Decrease limits when retrying command
  storage_proxy: Don't fetch superfluous partitions
  query::result: Add partition count
  column_family: Use counters in query::result::builder
  query_result_builder: Use the underlying counters
  mutation_partition: Count partitions in query_compacted
  mutation_partition: Remove tabs in query_compacted
  query::result::builder: Add partition count
  query_result_merger: Limit partitions

(cherry picked from commit 6bb875bdb7)
2016-12-27 16:36:27 +02:00
Avi Kivity
52e5706147 Point seastar submodule at scylla-seastar.git
Allow separate management of Scylla 1.6's version of sestar.
2016-12-27 16:33:22 +02:00
Pekka Enberg
cec07ea366 release: prepare for 1.6.rc0 2016-12-27 12:41:34 +02:00
Benoît Canet
cbe729415e scylla_setup: Use blkid or ls to list potentials block devices
blkid does not list root raw device.

Revert to lsblk while taking care of having a fallback
path in case the -p option is not supported.

Fixes #1963.

Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20161225100204.13297-1-benoit@scylladb.com>
(cherry picked from commit a24ff47c63)
2016-12-27 12:34:56 +02:00
Raphael S. Carvalho
b5190f9971 db: avoid excessive disk usage during sstable resharding
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.

Fixes #1952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
(cherry picked from commit 27fb8ec512)
2016-12-27 12:19:25 +02:00
Tomasz Grabiec
7739456ec2 sstables: Fix double close on index and data files when writing fails
file output streams take the responsibility of closing the file, they
will close the file as part of closing the stream.

During sstable writing we create sstable object and keep file
references there as well. Sstable object also has responsibility for
closing the files, and does so from sstable::~sstable().

Double close was supposed to be avoided by a construct like this:

  writer.close().get();
  _file = {};

However if close() failed, which can happen when write-ahead failed,
_file would not be cleared, and both the writer and sstable would
close the file. This will result in a crash in
append_challenged_posix_file_impl::close(), which is not prepared to
be closed twice.

Another problem is that if exception happened before we reached that
construct, we still should close the writer. Currently we don't, so
there's no double close on the file, but that's a bug which needs to
be fixed and once that's fixed double close on _file will be even more
likely.

The fix employed here is to not keep files inside sstable object when
writing. As soon as the writer is constructed, it's the only owner of
the file.

Fixes #1764.

Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f2a63270d1)
2016-12-27 11:11:51 +02:00
Takuya ASADA
e4da0167d4 dist/redhat: don't try to adduser when user is already exists
Currently we get "failed adding user 'scylla'" on .rpm installation when user is already exists, we can skip it to prevent error.

Fixes #1958

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1482550075-27939-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit f3e45bc9ef)
2016-12-27 09:47:52 +02:00
Vlad Zolotarov
a3266060e3 tracing: don't start tracing until a Tracing service is fully initialized
RPC messaging service is initialized before the Tracing service, so
we should prevent creation of tracing spans before the service is
fully initialized.

We will use an already existing "_down" state and extend it in a way
that !_down equals "started", where "started" is TRUE when the local
service is fully initialized.

We will also split the Tracing service initialization into two parts:
   1) Initialize the sharded object.
   2) Start the tracing service:
      - Create the I/O backend service.
      - Enable tracing.

Fixes issue #1939

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 62cad0f5f5)
2016-12-21 12:49:41 +02:00
Glauber Costa
f6c83f73ef track streaming and system virtual dirty memory
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.

After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
(cherry picked from commit 7133583797)
2016-12-21 12:44:30 +02:00
Avi Kivity
293876c72f Merge "Limit number of readers streaming uses" from Paweł
"Original, naive db::make_streaming_reader() implementation created a set
of memtable and sstable readers for every partition range. This caused
bad interaction with the code limiting sstable readers concurrency and
was suboptimal.

This series introduces multi range mutation reader that takes mutation
source and a sorted, disjoint vector of ranges. It creates only a single
set of memtable and sstable readers and fast forwards it to the next
range once the current one is completed."

* 'pdziepak/multi-range-reader/v1' of github.com:cloudius-systems/seastar-dev:
  db: use multi range reader for streaming readers
  dht: describe split_range[s]_to_shards() guarantees
  repair: remove outdated fixme
  test/mutation_reader_test: add multi_range_reader test
  tests/mutation_reader: extract key creation code
  mutation_reader: add multi_range_reader
2016-12-15 17:48:31 +02:00
Paweł Dziepak
cf679a413c db: use multi range reader for streaming readers
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.

The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
2016-12-15 13:54:43 +00:00
Paweł Dziepak
b86a826baf dht: describe split_range[s]_to_shards() guarantees
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
2016-12-15 13:07:32 +00:00
Paweł Dziepak
5287417136 repair: remove outdated fixme 2016-12-15 13:07:32 +00:00
Paweł Dziepak
5b0cf20f75 test/mutation_reader_test: add multi_range_reader test 2016-12-15 13:07:32 +00:00
Paweł Dziepak
787a976c2b tests/mutation_reader: extract key creation code 2016-12-15 13:07:32 +00:00
Paweł Dziepak
52a4e79210 mutation_reader: add multi_range_reader
So far, the only way to combine outputs of multiple readers was to use
combining reader. It is very general and, in particular, supports case
when the readers emit mutations from overlapping ranges.

However, we have cases (e.g. streaming) when we need to read from
several disjoint ranges. Combining reader is a suboptimal solution as it
requires to creating a reader for each range and ignores the fact that
they do not overlap.

This patch introduces multi_range_mutation_reader which takes a
mutation_source and a sorted set of disjoint ranges. Internally, it uses
mutation_reader::fast_forward_to() to move to the next range once the
current one is completed.
2016-12-15 13:07:31 +00:00
Pekka Enberg
06c5216c9d Merge "Improve gossip feature logging" from Asias 2016-12-15 10:36:54 +02:00
Asias He
e578e65103 gossip: Log feature enabled message on shard zero only
Feature is per node. No need to log them number of shards times.
2016-12-15 16:33:11 +08:00
Asias He
4137fab91b gossip: Make log in check_features debug level
We saw the message twice for the same feature check. This is a bit
confusing.

INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}

This is because

   ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE);
   ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE);

The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE)
is constructed. The second message is printed when the
ss._range_tombstones_feature is copy-constructed.
2016-12-15 16:33:10 +08:00
Asias He
2b1ebc4719 gossip: Introduce gms:features::enable helper
Add the helper function to enable the a feature and log the feature is
enabled.

When a feature is enabled, we see

INFO  2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled
INFO  2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled

in the log.
2016-12-15 16:33:10 +08:00
Paweł Dziepak
b70e5d2089 Merge seastar upstream
Submodule seastar 6fbd792..0b98024:
  > fstream: fix read ahead byte metric types
  > fstream: add read-ahead metrics
  > future-util: make stop_iteration use bool_class<>
  > util: introduce bool_class<Tag>
2016-12-14 15:01:13 +00:00
Avi Kivity
57f4910832 Merge "Query result size limiting" from Paweł
"This series makes Scylla limit size of query results it produces in case they
grow unreasonably large. This is possible because CQL paging queries do not
guarantee that the returned page is going to have page_size rows and pages
smaller than tha *do not* indicate end of stream. Non-paged queries and Thrift
requests do not have such flexibility and they also get all the requested data
(though their memory usage is still accounted for and may limit paged queries).

There is a maximum result size (1 MB) and all results builders will stop after
reaching it. Moreover, there is a per-shard limitation on the amount of memory
used by all results combined (10%). To avoid tiny results a query has to
reserve (wait if necessary) 4 kB before starting executing, after that it can
consume more memory without any additional waiting provided it is below
individual and shard-local limits.

Enabling the cluster to return less rows than requested also means some changes
for the coordinator. Firstly, if it receives such short result from a replica
retrying it with a larger limit obviously makes no sense whatsoever. Instead,
in such cases the coordinator removes the clustering rows it has incomplate
information about and sends short result back to the client. Moreover, even
if no replica returned short response reconciliation may have made it so. In
this case, the coordinator do not necessairly need to retry the query as well.
Unfortunately, with the current implementation short responses ruin data
queries since they will cause a digest mismatch.

Three new metrics were added:
 * database_bytes_total_result_memory -- total memory used by query results
 * database_total_operations_short_data_queries -- data queries that were
   limited by size, particulary bad as it basically forces coordinator to
   retry them as mutation queries
 * database_total_operations_short_mutation_queries -- mutation queries limited
   by size"

* 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: clean up after primary_key introduction
  cql3: allow short reads with paged queries
  storage_proxy: handle intentional short reads
  storage_proxy: make sure coordinator has complete data
  storage_proxy: honour partition limit
  storage_proxy: use cmd limits to determine that replica reached end
  db: add metrics for short reads and memory used for results
  data_query: limit result size
  mutation_query: limit result size
  db: create result_memory_accounters when starting query
  query_builder: add partition_slice getter
  reconcilable_result: keep result_memory_tracker object
  mutation_compactor: honour stop_iteration from consumers
  db: add result_memory_limiter
  query: add result size limiter
  reconcilable_result: properly propagate short_read flag
  query_pagers: handle short reads properly
  query: allow short reads
  serializer_impl: add serializer for bool_class<Tag>
2016-12-14 16:53:07 +02:00
Paweł Dziepak
4c69d7e2fe storage_proxy: clean up after primary_key introduction
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dde4bd5051 cql3: allow short reads with paged queries
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
3c173d87b5 storage_proxy: handle intentional short reads
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dd67de7218 storage_proxy: make sure coordinator has complete data
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).

However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
2ff5308d8e storage_proxy: honour partition limit
At the moment the coordinator does not care much for the partition
limit. In particular it doesn't check whether after reconciliation the
result still contains enough partitions.

This patch makes it honour the partition limit and increase it in the
retried queries if necessary.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
7bed7aa7de storage_proxy: use cmd limits to determine that replica reached end
Coordinator may retry a query with larger limits. However, code
determining whether replica has no more data always used the original
limits. This may cause a livelock.

For example, consider cluster having the following partitions (deletions
cover live cells):

node1:
pk=0, v=0
pk=1, v=1

node2
delete pk=0
delete pk=1
pk=2, v=2
pk=3, v=3

Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is
going to send partitions 0 and 1 while second node is going to send 2
and 3 + tombstones for 0 and 1. The coordinator will decide that it
needs to retry the request with larger row limit since node1 may have
some information about partitions 2 and 3 that are newer than what node2
has sent.

However, when the second response arrives node1 will still sent only two
rows since it has no more data. Because the coordinator uses original
row limit it will not notice that this node reached the end and we are
going to get another retry without making any progress.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
cfd4d0f680 db: add metrics for short reads and memory used for results
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
ba51e7e8db data_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
f1b9f49f2b mutation_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
6c33a4f177 db: create result_memory_accounters when starting query
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.

Moreover, range queries need a way of merging memory usage information
from different shards.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
0bce4047bd query_builder: add partition_slice getter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
15de8de9e5 reconcilable_result: keep result_memory_tracker object
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
34f9eb4cbd mutation_compactor: honour stop_iteration from consumers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
5d7185fd39 db: add result_memory_limiter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
ee89d80d5c query: add result size limiter
This patch introduces an infrastrucutre for limiting result size.

There is a shard-local limit which makes sure that all results combined
do not use more than 10% of the shard memory.
There is also an invidual limit which restricts a result to 4 MB.
In order

In order to avoid sending tiny results there is minimum guaranteed size
(4 kB), which the query needs to reserve before it starts producing the
result.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
43fe3439ca reconcilable_result: properly propagate short_read flag
reconcilable_result can be merged with another or transformed into
query::result. Make sure that short_read information is never lost.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
837d24f1b2 query_pagers: handle short reads properly
Currently, the paging implementation assumes that the server retunrs
either as many rows as it was asked for all reached the end. Soon,
that's not going to be true so instead of making any assumptions about
the number of the rows returned use the new "short read" flag to
determine whether there is going to be more data.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
da7ca85040 query: allow short reads
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Paweł Dziepak
7a15c89b1d serializer_impl: add serializer for bool_class<Tag>
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Takuya ASADA
8918a4be57 dist/common/scripts/scylla_setup: don't abort scylla_setup when each setup script failed
Instead of abort scylla_setup, print warning message then continue to next setup.

Fixes #1357

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481713664-18429-1-git-send-email-syuu@scylladb.com>
2016-12-14 13:31:50 +02:00
Tomasz Grabiec
c9344826e9 tests: Remove unintentional enablement of trace-level logging
Sneaked in by mistake.
2016-12-14 10:58:07 +01:00
Tomasz Grabiec
fe6a70dba1 tests: commitlog: Fix assumption about write visibility
The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.

Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.

Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
2016-12-14 11:29:33 +02:00
Avi Kivity
a61ff53150 Merge "rework flush criteria" from Glauber
"The current criteria for memtable flush is not being respected.  The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.

More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.

To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.

Fixes #1918"

* 'flush-criteria-v6' of github.com:glommer/scylla:
  config: get rid of memtable_total_space
  database: rework dirty memory hierarchy
  system keyspace: write batchlog mutation in user memory
  database: remove flush_token
  database: abstract pressure condition notification
  database: encapsulate semaphore_units into a flush_permit
  database: remove friendship declaration
  database: simplify flush_one
  database: make memtable_list aware in cases it can't flush
2016-12-14 11:24:10 +02:00
Takuya ASADA
c18a95cddf dist/redhat: add scylla_lib.sh to scylla.spec
Fix .rpm build error.

Fixes #1932

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481703992-9596-1-git-send-email-syuu@scylladb.com>
2016-12-14 10:27:37 +02:00
Glauber Costa
56df53f51e compaction_manager: fix shutdown sequence
By the time we are able to acquire this semaphore, we may be stopped
already. So we need to test it before we go ahead. I can see shutdown
hangs before this patch that are fixed with it applied.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <e5b378893128d086d584ffbb2acd3fb687648e5c.1481655433.git.glauber@scylladb.com>
2016-12-14 09:26:24 +01:00
Glauber Costa
2aa6514667 config: get rid of memtable_total_space
Those values are now statically set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 17:05:12 -05:00
Glauber Costa
80440c0d79 database: rework dirty memory hierarchy
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.

That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.

This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors

To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.

Fixes #1918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 14:07:53 -05:00
Glauber Costa
db7cc3cba8 system keyspace: write batchlog mutation in user memory
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:35 -05:00
Glauber Costa
be9e4c71ad database: remove flush_token
We had a flush_token structure in addition to the flush_permit because
we needed to keep a pointer to the dirty_memory_manager and apply
changes to the region group upon the region destruction. Since Tomek's
latest series, this is no longer needed and now this structure doesn't
have a place in the world anymore. Simplify the code by removing it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
98030ad66c database: abstract pressure condition notification
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
c9a8b03311 database: encapsulate semaphore_units into a flush_permit
We will soon need to hold more than a semaphore_units<> object per
flush, potentially.

Preparation patch for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
2e8c7d2c62 database: remove friendship declaration
Not needed anymore since memtable started having a direct pointer to the
memtable list.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
bb1509c21e database: simplify flush_one
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.

It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
8ab7c04caa database: make memtable_list aware in cases it can't flush
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.

We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.

It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Takuya ASADA
0a6312d254 dist/common/scripts/scylla_ntp_setup: fix incorrect usage of is_debian_variant
Use it as "if is_debian_variant; then".
Fixes #1931

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481644262-29383-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:42 +02:00
Takuya ASADA
ed4cd1908f dist/common/scripts/scylla_selinux_setup: correct CentOS/RHEL detection
CentOS/RHEL is using SELinux, and it's NOT Debian variant, so fixed from
"is_debian_variant" to "! is_debian_variant".

Fixes #1930

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643873-28984-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:29 +02:00
Takuya ASADA
6c0dc55495 dist/common/scripts/scylla_selinux_setup: to use is_debian_variant(), need to source /usr/lib/scylla/scylla_lib.sh
This fixes following command not found error:
```
/usr/sbin/scylla_selinux_setup: line 7: is_debian_variant: command not found
```

Fixes #1929

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643308-28637-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:13 +02:00
Takuya ASADA
3b74c50546 dist/ubuntu: add uuidgen to package dependency
We haven't added uuidgen to Ubuntu/Debian package dependency, so scylla_setup
script may abort because of command not found.

Fixes #1928

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481642385-27941-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:28:48 +02:00
Duarte Nunes
1e75a4950e database: Complete query when hitting partition limit
Currently, we weren't completing a query as early as possible if it
reached the partition limit, we instead had to wait until reaching the
end of the specified partition ranges. This patches fixes that by
including a check to the partition limit in the termination condition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>

Message-Id: <20161213114559.26438-1-duarte@scylladb.com>
2016-12-13 14:53:46 +02:00
Tomasz Grabiec
f451014785 schema: Implement operator<< for column_mapping
Message-Id: <1481310679-14074-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:20:46 +02:00
Tomasz Grabiec
059a1a4f22 db: Fix commitlog replay to not drop cell mutations with older schema
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.

During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.

Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.

Fixes #1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:19:32 +02:00
Avi Kivity
32d55bbb4c Merge seastar upstream
* seastar 0773e98...6fbd792 (2):
  > tls: Only run our "verify" function in client session
  > Merge "Clean the metric definition" from Amnon

Includes patch from Amnon adjusting the metrics registration due to seastar
API changes.
2016-12-13 12:17:14 +02:00
Avi Kivity
6f9c317b91 Merge "Use uuid file in housekeeping" from Amnon
"This patch adds the use of uuid file to the housekeeping daily version check.
uuid file are optional, if a file is missing no uuid will be used."
2016-12-13 10:52:44 +02:00
Avi Kivity
c67782f169 Merge seastar upstream
* seastar 0a74317...0773e98 (6):
  > tls: Add support for client cetrificate verification & priority strings
  > semaphore: add consume_units
  > semaphore: add available_units()
  > thread: check need_preempt for threads in a scheduling group as well
  > tutorial: fix semaphore example, and text
  > stop_iteration: add && and || operators
2016-12-12 18:06:19 +02:00
Avi Kivity
c801cc4bd1 Merge "streaming and repair updates" from Asias
"This series:
- We can make reader with ranges
- Fix possible use after free of 'si'
- Streaming ranges now are sorted and merged
- Fix shard_begin shard_end end loop in both streaming and repair"
2016-12-12 11:32:42 +02:00
Asias He
ba54654af3 streaming: Use interval_set to sort and merge ranges
So that the ranges are sorted and have no overlaps. We can have less
ranges to deal with and it can help the mutation readers to optimize.

Here is an exmaple of ranges generated by repair:

Before:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    before ranges = {(-3383928698815274642, -3376937163195039606],
    (-7260764223708720005, -7251657821052234309], (-4767213984179237293,
    -4747032371925842389], (-7645879646119667643, -7589962743703481776],
    (-2340199306656526861, -2320523117224780931], (-576028861239229331,
    -560973674020019962], (-4070378863644120252, -3987599893827407860],
    (-2551584407739673151, -2498779102482524711], (-5416061903556353312,
    -5354212455975869358], (37594980457713898, 67885601051654285],
    (3083778975065200884, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (5765437476661317163, 5778671293583720541],
    (5960610072466058818, 5972289771228014343], (7749618183851698485,
    7758080813117351135], (-3987599893827407860, -3899198931034439776],
    (-7251657821052234309, -7131649010279865221], (-3576581915808403133,
    -3383928698815274642], (-417850207760366422, -327959672080599465],
    (-2671876682129336880, -2551584407739673151], (-1305178847032904465,
    -1137497074548854552], (8540448858050275827, 8610171849752115483],
    (-560973674020019962, -417850207760366422], (-2498779102482524711,
    -2340199306656526861], (2394447940525988167, 2523396860109747637],
    (-6703329224557608009, -6517757811218772762], (-3675103288021821677,
    -3576581915808403133], (-5622185785296846551, -5416061903556353312],
    (8610171849752115483, 8742605005068551458], (8068079250973315241,
    8185655671734937642], (560264964510741191, 790641981923757238],
    (5581202487214475094, 5765437476661317163], (8742605005068551458,
    8923908282731801645], (-6038176423022601107, -5622185785296846551],
    (5778671293583720541, 5960610072466058818], (-3899198931034439776,
    -3675103288021821677], (8356739976149429222, 8540448858050275827],
    (-6517757811218772762, -6038176423022601107], (-8052600134279395253,
    -7645879646119667643], (-327959672080599465, 37594980457713898],
    (7758080813117351135, 8019254284118543066], (4781565016737645510,
    5067070718000527886], (2523396860109747637, 3083778975065200884],
    (-5354212455975869358, -4767213984179237293], (6784138025918878582,
    7190719703944308372], (67885601051654285, 447405341661896387],
    (-2190610927722759275, -1305178847032904465], (-4747032371925842389,
    -4070378863644120252]}, size=48

After:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    after  ranges = {(-8052600134279395253, -7589962743703481776],
    (-7260764223708720005, -7131649010279865221], (-6703329224557608009,
    -3376937163195039606], (-2671876682129336880, -2320523117224780931],
    (-2190610927722759275, -1137497074548854552], (-576028861239229331,
    447405341661896387], (560264964510741191, 790641981923757238],
    (2394447940525988167, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (4781565016737645510, 5067070718000527886],
    (5581202487214475094, 5972289771228014343], (6784138025918878582,
    7190719703944308372], (7749618183851698485, 8019254284118543066],
    (8068079250973315241, 8185655671734937642], (8356739976149429222,
    8923908282731801645]}, size=15
2016-12-12 11:09:26 +08:00
Asias He
e523803a5d token_metadata: Introduce interval_to_range helper
It is used to convert a boost::icl::interval<token> interval back to a
range<token>.
2016-12-12 11:09:26 +08:00
Asias He
af3d76e6ac repair: Fix a typo in the log
sucessfully -> successfully
2016-12-12 11:09:26 +08:00
Asias He
374324e6fb repair: Fix shard_begin and shard_end
A range now alternates between different shards: the first part of the
range goes to shard X, the next to shard X+1, but after a while we go
back to shard X. So we can't do a simple loop between shard_begin and
shard_end.

Fix by using the newly introduced dht::split_range_to_shards

Use the cf.make_streaming_reader with ranges to simplify the code a bit.
2016-12-12 11:09:26 +08:00
Asias He
1987264beb streaming: Make streaming reader with ranges
Now that we have the new interface to make readers with ranges, we can
simplify the code a lot.

1) Less readers are needed
before: number of ranges of readers
after: smp::count readers at most

2) No foreign_ptr is needed
There is no need to forward to a shard to make the foreign_ptr for
send_info in the first phase and forward to that shard to execute the
send_info in the second phase.

3) No do_with is needed in send_mutations since si now is a
lw_shared_ptr

4) Fix possible user after free of 'si' in do_send_mutations
We need to take a reference of 'si' when sending the mutation with
send_stream_mutation rpc call, otherwise:
   msg1 got exception
   si->mutations_done.broken()
   si is freed
   msg2 got exception
   si is used again
The issue is introduced in dc50ce0ce5 (streaming: Make the mutation
readers when streaming starts) which is master only, branch 1.5 is not
affected.
2016-12-12 09:04:21 +08:00
Asias He
463cc4fbde dht: Introduce split_ranges_to_shards
Split a ranges into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
044c4ff44c dht: Introduce split_range_to_shards
Split a range into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
cd2105b8bd database: make_streaming_reader for ranges
Allow to make a streaming reader with a vector of ranges in addition to
a single range. This will be used soon in following streaming patch.

We can make the reader more efficient later.
2016-12-12 09:04:21 +08:00
Duarte Nunes
ada2f1092e dht: Make i_partitioner::tri_compare pure virtual
This patch makes the i_partitioner::tri_compare() function pure
virtual as it is overridden by all partitioners.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211172037.16496-1-duarte@scylladb.com>
2016-12-11 19:29:37 +02:00
Duarte Nunes
bb66b051ed dht: Make i_partitioner::tri_compare memory safe
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
2016-12-11 18:58:10 +02:00
Amnon Heiman
08dcd8cb4a scylla housekeeping ubuntu service: use uuid file
This patch adds uuid file support for ubuntu system. It also split the
behaviour between restart and daily checks. The first run in r mode and
the second in d mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:35:07 +02:00
Amnon Heiman
6fef24aaf0 housekeeping systemd service: use uuid file
This set the housekeeping systemd service to use a uuid file and use
daily mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:02:16 +02:00
Amnon Heiman
17b8306bc4 scylla-housekeeping support uuid file
Allows scylla-housekeeping getting the uuid from a file instead of the
command line.

If the file is missing no uuid will be used.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:00:34 +02:00
Avi Kivity
299d1fad0b Merge "reduce bloom filter overhead in compaction" from Raphael
"Function to calculate maximum purgeable timestamp is made 10 times faster when
compacting sstables overlap with 10% of all sstables.
That's possible with an incremental selector that will incrementally select
sstables based on key being compacted.
Currently, we iterate through all non-compacting sstables and consult their
bloom filter to determine max purgeable timestamp, and that will be very
expensive for compactions that are frequently deciding whether or not to purge
tombstones."

* 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla:
  compaction: reduce bloom filter overhead with incremental selector
  tests: add test for sstable set's incremental selector
  sstable_set: introduce incremental selector
  compatible_ring_position: add function to return token
2016-12-11 09:46:58 +02:00
Glauber Costa
5803957ab5 compaction: fix build
Commit 732ee275 moved tracking of one statistics value inside a lambda
without capturing this in that lambda. Compilation fails as a result.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <68860640f4533dd43e43f341f1620e25464b700b.1481313455.git.glauber@scylladb.com>
2016-12-10 09:00:20 +02:00
Raphael S. Carvalho
fcfc84e836 compaction: reduce bloom filter overhead with incremental selector
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.

Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.

Fixes #1322.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
548f6066c5 tests: add test for sstable set's incremental selector
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
02541e15c1 sstable_set: introduce incremental selector
Incrementally select sstables from sstable set using token
in ascending order.
For leveled strategy, it returns all sstables that belong
to current interval. For other strategies, it just return
all sstables from the set.
Useful for compaction which needs all sstables that overlap
with key being currently compacted to calculate maximum
purgeable timestamp.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:16 -02:00
Glauber Costa
9b5e6d6bd8 commitlog: correctly report requests blocked
The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.

That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.

This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
2016-12-09 15:02:26 +02:00
Raphael S. Carvalho
732ee275f8 compaction: fix running compaction counter when splitting sstables
The counter was being increased before taking the semaphore, so
every pending split would count as a running compaction which
misleads the user as a result.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <f2050cc3599cee7af29d4579368a154708b37731.1481248048.git.raphaelsc@scylladb.com>
2016-12-09 15:01:43 +02:00
Raphael S. Carvalho
453620a316 compatible_ring_position: add function to return token
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-08 14:25:29 -02:00
Avi Kivity
872b5ef5f0 sstables: fix probe with Unknown component
Commit 53b7b7def3 ("sstables: handle unrecognized sstable component")
ignores unrecognized components, but misses one code path during probe_file().

Ignore unrecognized components there too.

Fixes #1922.
Message-Id: <20161208131027.28939-1-avi@scylladb.com>
2016-12-08 15:24:25 +01:00
Glauber Costa
733d87fcc6 database: try to acquire semaphore before we start flush
As Tomek pointed out, as we are starting the flush before we acquire the
semaphore, we are not really limiting parallelism, but only delaying the
end of the flush instead.

Fixes #1919

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <6cbf9ec2f3a341c76becf94f794cfa16539c5192.1481120410.git.glauber@scylladb.com>
2016-12-08 12:18:32 +01:00
Tomasz Grabiec
3511bf4a81 Merge branch 'tgrabiec/memtable-gentle-clearing' from seastar-dev.git
When row cache is disabled, update_cache() will do nothing to the
memtable. Active readers may keep the memtable alive for unbounded
amount of time, preventing it from going away. This doesn't play well
with virtual dirty accounting. Soon before calling update_cache(), the
memory which was subtracted during flush is added back to the amount
of virtual dirty memory. If there was write pressure all along, we
will be at the dirty memory limit. When we give back subtracted memory
this will put virtual dirty way above the limit. This will stall all
writes until another memtable flush drags virtual dirty down or
readers finally release the memtable. We want to prevent upward
jumps of virtual dirty.

First part of the fix is to ensure that as long as the memtable's
region is in the dirty group, we will not revert flushed memory. This
must happen synchronously from region's memory being removed from the
group in order to prevent upward virtual dirty jumps. To make this
easier, tracking of flushed memory was moved to the memtable object.

Another part of the fix is to gradually clear the memtable when cache
is disabled in a similar fashion as when it's moved to cache. This
ensures that the actual memory held by memtable's region is released
sooner than it dies.

Refs #1879
2016-12-08 12:18:32 +01:00
Gleb Natapov
a05516f14c storage_proxy: wire up range_slice_timeouts, range_slice_unavailables and read_unavailables counters
Message-Id: <20161206105154.GL1866@scylladb.com>
2016-12-08 11:42:52 +02:00
Avi Kivity
5530a61975 stables: fix build with older boost (boost::variant::get<T&>)
Older boost doesn't support boost::variant::get<T&> (where the type
parameter is reference qualified); remove (unneeded anyway).
2016-12-08 10:56:05 +02:00
Pekka Enberg
0bc3ce7e09 Merge "sstables: remove sharding metadata from Statistics component" from Avi
"Due to my misreading of Cassandra code, I thought it would ignore new
components in the Statistics component; however, it doesn't, and the change
(introduced in bdd11648ac ("sstables: add
intra-node sharding metadata") breaks sstable2json and likely any
Cassandra code that touches sstables.

To fix, move the sharding data into a new component ("Scylla.db"), which
Cassandra does ignore.  The new component is designed to be extensible so
we don't experience the same issue later on."
2016-12-08 10:14:07 +02:00
Avi Kivity
7f26f9c0f9 Merge "repair refactor and fix" from Asias
* tag 'asias/repair/subranges/refactor_fix/v1' of github.com:cloudius-systems/seastar-dev:
  repair: Limit the number of sub ranges
  repair: Use estimated_keys_for_range in repair_cf_range
  repair: Extract the target_partitions into repair_info class
  repair: Put request_transfer_ranges into repair_info class
  repair: Introduce check_failed_ranges helper
  repair: Introduce do_streaming helper
  repair: Make the neighbors const reference
  repair: Introduce repair_info
  repair: Attach the repair id in the stream plan name
2016-12-08 10:06:39 +02:00
Tomasz Grabiec
f7197dabf8 commitlog: Fix replay to not delete dirty segments
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.

The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.

Introduced in abe7358767.

dtest failure:

 commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup

Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
2016-12-07 15:54:47 +02:00
Avi Kivity
4fedbf8430 Merge "service::storage_proxy: rework collectd counters registration" from Vlad
- Add "coordinator" and "replica" categories
   - Use a new seastar/metrics_registration framework

* 'rearrange-storage-proxy-stats-v4' of github.com:cloudius-systems/seastar-dev:
  service::storage_proxy: rework the collectd counters registration
  service/storage_proxy: regroup collectd statistics
2016-12-07 15:38:40 +02:00
Avi Kivity
3c3a18f222 sstables: move sharding metadata from Statistics component to a new Scylla component
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
2016-12-07 15:20:13 +02:00
Avi Kivity
24140ec8c6 sstables: add support for sets of discriminated union types
Allow declaring discriminated unions (with an enum type as the
discriminant and any sstable serializable type as a value) and sets
of these unions, with the disciminant as the key.  Parsers and writers
are auto-generated.
2016-12-07 13:27:52 +02:00
Avi Kivity
e0cce9d299 Merge "streaming: Improve logging" from Asias
"This seires adds streaming bandwidth and streaming plan name to the log when
streaming is finished."
2016-12-07 12:21:47 +02:00
Amos Kong
f32f7993cc systemd: reset housekeeping timer at each start
Currently housekeeping timer won't be reset when we restart scylla-server.
We expect the service to be run at each start, it will be consistent with
upstart script in Ubuntu 14.04

When we restart scylla-server, housekeepting timer will also be restarted,
so let's replace "OnBootSec" with "OnActiveSec".

Fixes: #1601

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <a22943cc11a3de23db266c52fd476c08014098c4.1480607401.git.amos@scylladb.com>
2016-12-06 18:33:37 +02:00
Takuya ASADA
5a5ab51254 dist/ubuntu/dep: fix incorrect file path to detect previously built .deb existance check
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-4-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
6dd6b868a6 scripts/scylla_install_pkg: support Debian
Supported Debian on installation script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-3-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
7f2df8f86e dist/common/scripts: introduce scylla_lib.sh
To reduce duplicated code and simplified scripts introduce scylla_lib.sh
for shellscripts which provides functions to classify distributions,
and load all sysconfig files.

This also fixes script bugs to misdetect Debian and RHEL.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-2-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
8464903021 dist/common/systemd/scylla-housekeeping.timer: workaround to avoid crash of systemd on RHEL 7.3
RHEL 7.3's systemd contains known bug on timer.c:
https://github.com/systemd/systemd/issues/2632

This is workaround to avoid hitting bug.

Fixes #1846

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480452194-11683-1-git-send-email-syuu@scylladb.com>
2016-12-06 10:48:28 +02:00
Takuya ASADA
b2c0059da3 dist/common/scripts/scylla_coredump_setup: use systemd-coredump on Ubuntu 16.04
Ubuntu 16.04 has systemd-coredump, better to use it.

Fixes #1916

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480679267-30844-1-git-send-email-syuu@scylladb.com>
2016-12-05 17:09:38 +02:00
Takuya ASADA
2976799ef2 main: fix startup failing on Ubuntu 15.10/16.04
Since Ubuntu 15.10/16.04 still uses Upstart to manage GUI session (not as init), when we directly launch Scylla on Ubuntu's GUI Terminal(not using systemctl or initctl), raise(SIGSTOP) mistakenly calls (Because GUI session has "UPSTART_JOB" environment variable, won't happen when running Scylla as systemd service).

To avoid this, we need to verify UPSTART_JOB == "scylla-server".
If it's part of GUI session UPSTART_JOB has to be "unity7", we need to avoid raise(SIGSTOP) in that case.

Fixes #1199

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480620421-28967-1-git-send-email-syuu@scylladb.com>
2016-12-05 16:28:25 +02:00
Tomasz Grabiec
527ff6aa40 db: Clear memtable after flush when cache is disabled
So that memory is released gradually (impacting latency less) and
sooner than when memtable is destroyed. Active readers may keep the
memtable alive for unbounded amount of time.

Refs #1879
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1bba51319e memtable: Maintain virtual dirty on clear()
When memtable is flushing, it subtracts _flushed_memory from groups's
size to gradually allow more writes. Ideally _flushed_memory would be
equal to region's size when flush ends, so the group's size would
reach zero. When the memtable and its region are gone the group size
should remain the same as after the flush. This is ensured by adding
back _flushed_memory to group's size right before the region is
removed from the group.

Calling clear() before region is removed from the group breaks the
accounting because it will shrink the region, but will not affect the
amount of memory subtracted due to _flushed_memory. So group's size
would decrease more than we want (twice the region's size). The fix is
to change clear() so that it reverts _flushed_memory by the amount by
which the region size is reduced. This will keep the groups's size
constant as long as _flushed_memory > 0.
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1b5f338c17 memtable: Track flushed memory in memtable object 2016-12-05 12:59:09 +01:00
Tomasz Grabiec
c3768fe4de memtable: Pass dirty_memory_manager& to memtable constructor
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:

  boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));

This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
2016-12-05 12:59:09 +01:00
Asias He
00d7a35949 utils: Put crc32 under utils namespace
It conflicts with crc in zlib
Message-Id: <1480918984-4117-2-git-send-email-asias@scylladb.com>
2016-12-05 11:48:29 +02:00
Takuya ASADA
54ea0055fc dist/common/scripts/node_exporter_install: use curl instead of wget
CentOS/Ubuntu contains curl on minimal instllation but wget doesn't,
and we already has dependency for curl, so we should switch to curl.

Fixes #1902

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480929047-2347-1-git-send-email-syuu@scylladb.com>
2016-12-05 11:26:36 +02:00
Asias He
86c2620b7a gossip: Skip stopping if it is not started
If exception is triggered early in boot when doing an I/O operation,
scylla will fail because io checker calls storage service to stop
transport services, and not all of them were initialized yet.

Scylla was failing as follow:
scylla: ./seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local()
[with Service = gms::gossiper]: Assertion `local_is_initialized()' failed.
Aborting on shard 0.
Backtrace:
  0x000000000048a2ca
  0x000000000048a3d3
  0x00007fc279e739ff
  0x00007fc279ad6a27
  0x00007fc279ad8629
  0x00007fc279acf226
  0x00007fc279acf2d1
  0x0000000000c145f8
  0x000000000110d1bc
  0x000000000041bacd
  0x00000000005520f1
  0x00007fc279aeaf1f
Aborted (core dumped)

Refs #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <963f7b0f5a7a8a1405728b414a7d7a6dccd70581.1479172124.git.asias@scylladb.com>
2016-12-05 09:42:37 +02:00
Asias He
49229964d0 streaming: Add streaming plan name when session is failed
Before:
[shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000]
Stream failed, peers={127.0.0.1, 127.0.0.2}

After:
[shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000]
Stream failed for streaming plan repair-in-29, peers={127.0.0.1, 127.0.0.2}
2016-12-05 08:20:18 +08:00
Asias He
1c47e26913 streaming: Add streaming plan name when all sessions are completed
Before:
[shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000]
All sessions completed, peers={127.0.0.2}

After:
[shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000]
All sessions completed for streaming plan repair-in-32, peers={127.0.0.2}
2016-12-05 08:20:18 +08:00
Asias He
984f427cb5 streaming: Log streaming bandwidth
It looks like:

[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.1 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.2 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.3 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] bytes_sent = 393284364, bytes_received = 0, tx_bandwidth = 17.048 MiB/s, rx_bandwidth = 0.000 MiB/s
[Stream #f3907fd0-a557-11e6-a583-000000000000] All sessions completed, peers={127.0.0.1, 127.0.0.2, 127.0.0.3}

Fixes #1826
2016-12-05 08:20:18 +08:00
Asias He
4ae5781e40 repair: Limit the number of sub ranges
A range is diveded into N sub ranges so that each sub range contains 100
partitions. So N depends on the number of partitions in that range.  N
can grow unbounded and the memory usage of vector to hold these sub
ranges can go unbouded.

Limit the max number of sub ranges a range can divided into.

The downside is that the limited sub range will make we include more
partitions in the checksum.

Fixes #1917
2016-12-05 08:12:48 +08:00
Asias He
d850b86145 repair: Use estimated_keys_for_range in repair_cf_range
Use the newly introduced interface to estimate number of partitions in
the range.
2016-12-05 08:05:07 +08:00
Asias He
7b63cbbe0d repair: Extract the target_partitions into repair_info class
We can tune the number on a per repair basis.
2016-12-05 08:05:07 +08:00
Asias He
d9b689321e repair: Put request_transfer_ranges into repair_info class 2016-12-05 08:05:07 +08:00
Asias He
7741393059 repair: Introduce check_failed_ranges helper
To check if there is any failed ranges and log it.
2016-12-05 08:05:07 +08:00
Asias He
f8d7aa597b repair: Introduce do_streaming helper
To execute the stream_plans to sync data between nodes.
2016-12-05 08:05:07 +08:00
Asias He
d0a6290d4f repair: Make the neighbors const reference
We do not modify it. Make it const reference.
2016-12-05 08:05:07 +08:00
Asias He
6d0f6c1a99 repair: Introduce repair_info
To reduce the number of parameters we pass around. Simplify the code a
little bit.
2016-12-05 08:05:06 +08:00
Asias He
9be5170c07 repair: Attach the repair id in the stream plan name
So that we know which repair id this stream plan belongs to.
2016-12-05 08:05:06 +08:00
Tomasz Grabiec
d496dfeced Update seastar submodule
* seastar 7790e68...0a74317 (2):
  > core/reactor: Move definitions out of #ifndef
  > Add systemtap-sdt-devel to fedora dependencies

Fixes #1915.
2016-12-02 10:49:17 +01:00
Vlad Zolotarov
e5e7ac1bd4 service::storage_proxy: rework the collectd counters registration
Use the new seastar's metrics_registration framework:
   - Change the registration syntax.
   - Add a long description for each counter.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-12-01 22:38:09 -05:00
Vlad Zolotarov
3bf12e4ffc service/storage_proxy: regroup collectd statistics
Instead of putting all statistics under the same "storage_proxy" category
separate them into 2 groups according to where the corresponding counters
are updated:
   - "storage_proxy_replica"
   - "storage_proxy_coordinator"

Fixes #1763

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-12-01 22:27:47 -05:00
Glauber Costa
99a5a77234 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>
2016-12-01 13:20:46 +01:00
Tomasz Grabiec
570fc0008b scylla-gdb: Fix lookup of symbols in 'scylla ptr'
Message-Id: <1480529617-26564-1-git-send-email-tgrabiec@scylladb.com>
2016-12-01 12:33:29 +02:00
Raphael S. Carvalho
b30a2cb21a lcs: generate info that preserves token distribution in higher levels
The information (last compacted keys) is lost after node is restarted
or schema is updated, which causes strategy to be rebuilt.
We need it for strategy to guarantee uniform distribution of token
range across sstables, or we could end up with 1 sstable of level L
overlapping with lots of sstables of level L+1, and that results in
a compaction of undesired length.
That information can be generated from scratch by getting last key
of newest sstable in each level > 0.

Fixes #1906.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <35ebd15977d5a8418239febb160c796cdc0e98fa.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:58 +02:00
Raphael S. Carvalho
38743c1948 sstables: provide write time of data component
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <59686148149f2159990329775e0cd8780bc54254.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:57 +02:00
Glauber Costa
d7256e7b21 database: do not call seal directly from the streaming timer
Streaming memtable have a delayed mode where many flushes are coalesced
together into one, with the actual flush happening later and propagated
to all the previous waiters.

However, the timer that triggers the actual flush was not using the
newly introduced flush infrastructure. This was a minor problem because
those flushes wouldn't try to take the semaphore, and so we could have
many flushes going on at the same time.

What was a potential performance issue became a correctness issue when
we moved the reversal of the dirty memory accounting out of
revert_potentially_cleaned_up_memory() into remove_from_flush_manager().

Since the latter is only called through the flush infrastructure, it
simply wasn't called. So the deferral of the reversal exposed this bug.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>
2016-11-30 18:00:55 +01:00
Tomasz Grabiec
c35e18ba12 tests: Fix use-after-free on commitlog
Only shutdown() ensures all internal processes are complete. Call it before calling clear().

Message-Id: <1480495534-2253-1-git-send-email-tgrabiec@scylladb.com>
2016-11-30 11:03:26 +02:00
Avi Kivity
281b4c64ea Update ami submodule
* dist/ami/files/scylla-ami 25e101f...d5a4397 (1):
  > scylla_install_ami: allow specify different repository for Scylla installation and receive update
2016-11-29 19:26:49 +02:00
Takuya ASADA
17ef5e638e dist/ami: allow specify different repository for Scylla installation and receive update
This fix splits build_ami.sh --repo to three different options:
 --repo-for-install is for Scylla package installation, only valid
 during AMI construction.

 --repo-for-update will be stored at /etc/yum.repos.d/scylla.repo, to
 receive update package on AMI.

 --repo is both, for installation and update.

Fixes #1872

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480438858-6007-1-git-send-email-syuu@scylladb.com>
2016-11-29 19:26:07 +02:00
Avi Kivity
5ea235e3e8 Merge "Prevent overloading memory with timed out writes" from Tomasz
"The goal of this series is to prevent unbounded memory use
in cases when requests are timing out. Write requests which timed
out may still occupy memory for a while because of local mutation
application. This memory is not accounted for and can build up.

First part of the fix changes local mutation application so that it times out
at about the same time as the request handler. Then the life
time of the request handler is extended to cover any background activity
of that request which hasn't timed out yet. This has two main effects:

  (1) by timing out local writes we prevent build up of background activity
      for timed out requests

  (2) we ensure that memory used by background activity is not left
      behind unaccounted for. This will prevent CQL server from admitting
      more requests than memory usage limit allows.

Fixes #1756."

* tag 'tgrabiec/prevent-oom-on-timeouts-v5' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: Do not flood logs with timeout errors
  database: Add counter for timed out writes
  storage_proxy: Delay timeout response until background work ceases
  storage_proxy: Propagate timeout to local writes
  storage_proxy: Use shared ownership for abstract_write_response_handler
  storage_proxy: Add counter for all alive write handlers
  db: Allow writes to be timed out
  db: Introduce counters for failed reads and writes
  commitlog: Allow allocations to be timed out
  utils/logalloc: Add ability to timeout run_when_memory_available() task
  utils/flush_queue: Add ability to wait with a timeout
2016-11-29 18:55:52 +02:00
Avi Kivity
28a5ff51cb dist: add build dependency on systemtap-sdt
Needed to newer seastar.
2016-11-29 18:49:51 +02:00
Tomasz Grabiec
48bbd6733c storage_proxy: Do not flood logs with timeout errors
Timeout errors are flooding the log after local mutate can time
out. We don't log remote mutate timeouts, so for consistency we won't
log local ones as well.

There is a database counter for timed out writes which can be
consulted in order to check if they're occuring.

Perhaps this would be better solved by a generic log message
throttling/coalescing mechanism, but that's not ready yet.
2016-11-29 16:40:59 +01:00
Tomasz Grabiec
b5d5612f98 database: Add counter for timed out writes 2016-11-29 16:40:59 +01:00
Tomasz Grabiec
14cb31f69a storage_proxy: Delay timeout response until background work ceases
Write requests which timed out may still occupy memory for a while due
to local write. It should time out soon as well but there is a time
window in which it has not yet. If we don't delay timeout response,
the request would be seen as not consuming any memory too early. This
in turn would cause the CQL server to allow more requests than we
want. In some cases causing OOM or exceeding memory limits and causing
excessive cache eviciton.

Fixes #1756.
2016-11-29 16:40:59 +01:00
Tomasz Grabiec
ba3779802f storage_proxy: Propagate timeout to local writes 2016-11-29 16:40:59 +01:00
Tomasz Grabiec
6d195a1538 storage_proxy: Use shared ownership for abstract_write_response_handler 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
5805330d98 storage_proxy: Add counter for all alive write handlers
Currently the counter uses _response_handlers.size(), but after later
patches we may have an active (timed out) write with no response
handler, so count live instances instead.
2016-11-29 16:40:58 +01:00
Tomasz Grabiec
2c561ecaed db: Allow writes to be timed out 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
b1ae6ad2ad db: Introduce counters for failed reads and writes 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
31645e2c4a commitlog: Allow allocations to be timed out 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
e14caaef60 utils/logalloc: Add ability to timeout run_when_memory_available() task 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
61d81617e1 utils/flush_queue: Add ability to wait with a timeout 2016-11-29 16:40:58 +01:00
Raphael S. Carvalho
a16425833c size_tiered: do not recreate bucket when it goes beyond max threshold
Problem will cause size tiered to return small jobs when there are
more than max_threshold sstables of similar size. For example, if
max_threshold is 32, and there are 36 sstables of similar size,
strategy will only return 4 sstables to be compacted. That's because
we incorrectly create a new bucket when it meets the max threshold.
What we should do is to allow buckets to grow beyond max threshold
and trim them when selecting the most suitable one for compaction.

Important to mention that estimation for size tiered will now
work better when there are more than max_threshold sstables of
similar size.

Fixes #1901.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <080bad70d6cb86eaf52ac1bdd6765ac47aab5b03.1478316140.git.raphaelsc@scylladb.com>
2016-11-29 16:56:02 +02:00
Glauber Costa
353a4cd2d4 commitlog: sync segments before acquiring semaphore on shutdown.
Sync all segments before acquiring the semaphore, otherwise waiting may
have to wait for the timer to kick in and push them down.
Note that we can't guarantee that no other requests were executed in the
mean time, so we have to sync again.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>
2016-11-29 11:07:28 +02:00
Tomasz Grabiec
96c7764458 Revert "prevent commitlog replay position reordering during reserve refill"
This reverts commit 0e9b75d406.

commitlog_test fails with this:

Running 14 test cases...
ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed
ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required)

*** 1 failure detected in test suite "tests/commitlog_test.cc"
WARN  2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
2016-11-28 20:52:13 +01:00
Raphael S. Carvalho
f141b0cdae database: atomically add new sstables to cf when refreshing
New sstables are loaded and added in parallel, meaning that scylla can
potentially return stale data if a new sstable containing a tombstone
wasn't loaded yet. Compaction should also not run until all new sstables
are added for similar reasons.

Fix is about separating blocking and non-blocking steps to allow
atomic add of multiple new sstables.

Fixes #1368.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>
2016-11-28 20:30:48 +02:00
Glauber Costa
0e9b75d406 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>
2016-11-28 19:26:26 +01:00
Takuya ASADA
1042e40188 dist/common/scripts/scylla_kernel_check: fix incorrect document URL
Fixes #1871

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480327243-18177-1-git-send-email-syuu@scylladb.com>
2016-11-28 13:51:19 +02:00
Avi Kivity
18df2d9e9e partition_version: fix const correctness in rows_entry_compare
Using a non-const-correct comparator results in build failures with
boost 1.55.

Fixes #1892.
Message-Id: <20161128104335.28789-1-avi@scylladb.com>
2016-11-28 10:55:12 +00:00
Avi Kivity
5358984982 Merge seastar upstream
* seastar 93c3b12...7790e68 (7):
  > core/reactor: Introduce reactor-*/dervie-busy_ns metric
  > Collectd: Hold a reference to the metrics implementation in registration
  > future: Improve comments
  > fstream: actually use dynamically adjusted buffer
  > debug: add latency detector script
  > reactor: add static probes for latency detector
  > semaphore: Fix with_semaphore() in case wait() throws
2016-11-28 11:05:59 +02:00
Avi Kivity
28857e42e7 Merge " Virtualize size_estimates system table" from Duarte
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.

This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.

Fixes #1616"

* 'virtual-estimates/v4' of github.com:duarten/scylla:
  size_estimates_virtual_reader: Add unit test
  db: Delete size_estimates_recorder
  size_estimates: Add virtual reader
  column_family: Add support for virtual readers
  storage_service: get_local_tokens() returns a future
  nonwrapping_range: Add slice() function
  range: Find a sequence's lower and upper bounds
  system_keyspace: Build mutations for size estimates
  size_estimates: Store the token range as bytes
  range_estimates: Add schema
  murmur3_partitioner: Convert maximum_token to sstring
2016-11-28 10:12:59 +02:00
Avi Kivity
176fca5775 logalloc: use correct header for unique_ptr
<bits/unique_ptr.hh> is a libstdc++ internal header.  USe <memory> instead.
2016-11-27 23:08:04 +02:00
Glauber Costa
c32803f2f0 database: move reversion of virtual dirty state closer to update_cache.
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.

During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.

This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
2016-11-24 18:18:15 +01:00
Raphael S. Carvalho
4781b6eb71 sstables: use nonwrapping_range::make to avoid compilation issues
GCC 5.3.1 was unable to convert bound to optional<bound>.

sstables/sstables.cc:2494:123: error: no matching function for call to
‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position,
dht::ring_position)’
(dtr.right.exclusive ? dht::ring_position::starting_at :
dht::ring_position::ending_at)(std::move(t2)));

In file included from ./dht/i_partitioner.hh:52:0,
                 from ./query-request.hh:28,
                 from ./clustering_key_filter.hh:27,
                 from sstables/sstables.hh:35,
                 from sstables/sstables.cc:38:
./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range(
const wrapping_range<U>&) [with T = dht::ring_position]
     explicit nonwrapping_range(const wrapping_range<T>& r)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>
2016-11-24 11:26:16 +02:00
Duarte Nunes
cc3f26c993 lz4: Conditionally use LZ4_compress_default()
Since not all distributions have a version of LZ4 with
LZ4_compress_default(), we use it conditionally.

This is specially important beginning with version 1.7.3 of LZ4,
which deprecates the LZ4_compress() function in favour of
LZ4_compress_default() and thus prevents Scylla from compiling
due to the deprecated warning.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161124092339.23017-1-duarte@scylladb.com>
2016-11-24 11:25:03 +02:00
Avi Kivity
1be95b1227 Merge seastar upstream
* seastar d6f26d8...93c3b12 (3):
  > rpc: Conditionally use LZ4_compress_default()
  > queue: allow queue to change its maximum size
  > util/defer: add missing return to move assignment
2016-11-24 11:00:53 +02:00
Duarte Nunes
a527ba285f thrift: Don't apply cell limit across rows
In Thrift, SliceRange defines a count that limits the number of cells
to return from that row (in CQL3 terms, it limits the number of rows
in that partition). While this limit is honored in the engine, the
Thrift layer also applies the same limit, which, while redundant in
most cases, is used to support the get_paged_slice verb.

Currently, the limit is not being reset per Thrift row (CQL3
partition), so in practice, instead of limiting the cells in a row,
we're limiting the rows we return as well. This patch fixes that by
ensuring the limit applies only within a row/partition.

Fixes #1882

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161123220001.15496-1-duarte@scylladb.com>
2016-11-24 10:38:31 +02:00
Takuya ASADA
ce80fb3a39 dist/ubuntu: increase number of open files on Ubuntu 14.04(upstart)
Follow the change of NOFILE for non-systemd environment.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479975050-14907-1-git-send-email-syuu@scylladb.com>
2016-11-24 10:13:41 +02:00
Avi Kivity
d58c8aaa32 db: remove unused belongs_to_{current,other}_shard(s) functions
Obsoleted by new sharding mechanism, but break the build for some.
2016-11-23 21:39:29 +02:00
Avi Kivity
b81a57e8eb config, dht: reduce default msb ignore bits to 4
With the default value of 12, a node's range is partitioned into
4096 * smp::count sub-ranges which are queried sequentually for a range
scan.  If the number of rows in the table is smaller than the required
result size, we will query all of them.  This can take so long that we
time out.

A better fix is to query multiple sub-ranges in parallel and merge them,
but for that we need to resurrect the non-sequential merger.
2016-11-23 21:25:37 +02:00
Pekka Enberg
c526a9f0be Update seastar submodule
* seastar 7473945...d6f26d8 (2):
  > semaphore_units: add missing return statement
  > metrics: Do not detroy the metrics layer if it is been used
2016-11-23 20:27:09 +02:00
Paweł Dziepak
919825a2c7 Merge "Improve sharding in large clusters" from Avi
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively.  Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.

This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes.  Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.

In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.

Fixes #1277."
2016-11-23 11:25:53 +00:00
Glauber Costa
18b9fa3d43 dist: increase number of open files
This limit was found to be too low for production environments. It would
be hit at boot, when we're touching a lot of files from multiple shards
before deciding that we don't need them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <87bbf43da1a67f5fa6174017205c6ef8bdb0dc3d.1479829232.git.glauber@scylladb.com>
2016-11-23 13:10:25 +02:00
Avi Kivity
07d5a20bae Wire up sharding ignore msb parameter to configuration
We might have used a fancy map<sstring, any> to pass the parameters, but
that's overkill for now.
2016-11-22 22:40:47 +02:00
Avi Kivity
8b1d689de8 partitioner: add ignore_msb parameters to byte ordered and random partitioners
Ignored; doesn't make sense on byte ordered, and random is deprecated.
2016-11-22 21:56:42 +02:00
Avi Kivity
af16c0fac4 murmur3_partitioner: shard on the middle token bits, not most significant bits
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.

This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy.  In effect, with changes the token range layout from

   shard 0
   shard 1
   ...
   shard S-1

to

   shard 0
   shard 1
   ...
   shard S-1

   shard 0
   shard 1
   ...
   shard S-1

   ...

   shard 0
   shard 1
   ...
   shard S-1

Where the number of repetitions of the block is 2^(ignored msb bits).

For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
2016-11-22 21:56:42 +02:00
Avi Kivity
024c8ef8a1 db: adjust sstable load to use sstable self-reporting of shard ownership
Instead of calculating the owning shard from the sstable's partition
key range, delegate to the new sstable method for getting owning shard
infomation.  This insulates us from changes in the sharding algorithm.
2016-11-22 21:56:40 +02:00
Avi Kivity
98a4544e1c sstables: add method to get sstable owning shards from an unloaded sstable
When we load an sstable, we don't know beforehand which shards it belongs
to; we don't want to open it until we do.  Add a method that allows us
to read just the sharding data, without opening anything else.
2016-11-22 21:52:23 +02:00
Avi Kivity
bdd11648ac sstables: add intra-node sharding metadata
Add a metadata component that describes token ranges that are spanned by
this sstable.  With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
2016-11-22 21:44:25 +02:00
Avi Kivity
316ef1d70a sstables: automate writing statistics components
Add a virtual funnction to metadata_base so we can loop over statistics
components when writing them.
2016-11-22 21:05:06 +02:00
Glauber Costa
13973e7f3b keep background work semaphore alive during sstable flush
We have a semaphore controlling the amount of background work generated
by the memtable flush process. However, because we are not moving it
inside the memtable post-flush continuation, the units are being
released when we star the flush and not when we finish it.

That's not the intended behavior and that can cause flushes to
accumulate.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>
2016-11-22 19:54:08 +01:00
Avi Kivity
d05b22e502 sstables: automatically calculate offsets in statistics
Instead of calculating the offset for each statistic component manually,
use a loop to iterate over all components, accumulating the offset as we
go along.
2016-11-22 20:35:24 +02:00
Avi Kivity
7c5e6525ef sstables: switch statistics components to generic serialized_size() implementation 2016-11-22 20:20:38 +02:00
Avi Kivity
096ae59a5b sstables: introduce generic serialized_size()
Introduce a new function that reuses the file_writer code to compute
the serialized size of an sstable object, by serializing it into memory
and discarding the result.
2016-11-22 20:06:23 +02:00
Avi Kivity
3c06ffac9d sstables: const correctness for the write(file_writer&, T&) functions
write() doesn't need to change its input; so change it to const.

The only snag is that describe_type() isn't and can't be made const-correct,
so cheat when it is called and const_cast the input.

This helps in writing a generic serialized_size() that is const correct,
in the next patch.
2016-11-22 20:04:27 +02:00
Tomasz Grabiec
eefc538225 Update seastar submodule
* seastar 7504026...7473945 (1):
  > Merge "Improve support for timeouts in primitives"
2016-11-22 17:51:29 +01:00
Glauber Costa
0b8b5abf16 commitlog: acquire semaphore earlier
Recently we have changed our shutdown strategy to wait for the
_request_controller semaphore to make sure no other allocations are
in-flight. That was done to fix an actual issue.

The problem is that this wasn't done early enough. We acquire the
semaphore after we have already marked ourselves as _shutdown and
released the timer.

That means that if there is an allocation in flight that needs to use a
new segment, it will never finish - and we'll therefore neve acquire
the semaphore.

Fix it by acquiring it first. At this point the allocations will all be
done and gone, and then we can shutdown everything else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>
2016-11-21 22:19:32 +00:00
Avi Kivity
6bdb8ba31d storage_proxy: don't query concurrently needlessly during range queries
storage_proxy has an optimization where it tries to query multiple token
ranges concurrently to satisfy very large requests (an optimization which is
likely meaningless when paging is enabled, as it always should be).  However,
the rows-per-range code severely underestimates the number of rows per range,
resulting in a large number of "read-ahead" internal queries being performed,
the results of most of which are discarded.

Fix by disabling this code. We should likely remove it completely, but let's
start with a band-aid that can be backported.

Fixes #1863.

Message-Id: <20161120165741.2488-1-avi@scylladb.com>
2016-11-21 18:19:46 +02:00
Glauber Costa
0ca8c3f162 database: keep a pointer to the memtable list in a memtable
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.

Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.

That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.

Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.

At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.

The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.

Fixes #1870

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
2016-11-21 18:18:27 +02:00
Duarte Nunes
def2bc72b0 size_estimates_virtual_reader: Add unit test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
6a37d87c76 db: Delete size_estimates_recorder
Now that access to the size_estimates system is virtualized, we no
longer need the recorder.

Fixes #1616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
225648780d size_estimates: Add virtual reader
This patch add a virtual mutation_reader so that queries
to the size_estimates system table are handled by the engine
without needing to perform any IO.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
cd7e2fd602 column_family: Add support for virtual readers
Virtual readers allow queries to selected tables, usually system
tables, to be answered by the engine. This is useful for tables which
aren't written by users and whose contents can be calculated on
demand.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
c0d450c57d storage_service: get_local_tokens() returns a future
This patch changes the get_local_tokens() function in storage_service
to return a future instead of requiring running under a seastar::thread.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
9b384d375f nonwrapping_range: Add slice() function
This patch add the slice() function to nonwrapping range, which uses
its bounds to slice an input sequence.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
bdba8d99c3 range: Find a sequence's lower and upper bounds
This patch extracts a pair of functions from mutation_partition to
calculate the lower and upper bounds of a sequence from a
nonwrapping_range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
636287fdf2 system_keyspace: Build mutations for size estimates
This patch adds a function to system_keyspace responsible for creating
a mutation to a partition of the size_estimates system table from a
set of range_estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
18ddec245e size_estimates: Store the token range as bytes
This patch changes the range_estimates struct so that the tokens are
represented as utf8 encoded bytes. This will make future patches
require less conversions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:14:21 +00:00
Duarte Nunes
e7a5162c1d range_estimates: Add schema
This will be used in future patches, when virtualizing the
size_estimates system table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Duarte Nunes
01815ecd24 murmur3_partitioner: Convert maximum_token to sstring
This patch ensures we can convert the maximum_token to an sstring.
For Cassandra, the minimum and maximum tokens have the same
representation. So, we use the string representation of the
maximum_token for the maximum_token.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Takuya ASADA
eee63027e5 dist/ami/build_ami.sh: update base AMI to CentOS7-Base5
To drop unnecessary .ssh/authozied_keys, we need to update base AMI.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479496938-29724-1-git-send-email-syuu@scylladb.com>
2016-11-21 10:12:47 +02:00
Avi Kivity
783729c540 Merge "Clean up T::memory_usage() function" from Paweł
"This series is just a cleanup which intention is to deal with all
confusion related to the way T::memory_usage() functions work.

* T::memory_usage() which returned external memory usage are renamed
  to T::external_memory_usage()
* T::memory_usage() is introduced where needed to avoid repeating
  sizeof(T) + T::external_memory_usage()"

Paweł Dziepak (6):
  rename memory_usage() to external_memory_usage() where applicable
  streamed_mutation: add memory_usage() to mutation fragment types
  keys: add memory_usage()
  partition_snapshot_accounter: use range_tombstone::memory_usage()
  mutation_rebuilder: use memory_usage()
  frozen_mutation: use memory_usage()
2016-11-21 10:11:39 +02:00
Avi Kivity
498887ca0d Merge seastar upstream
* seastar 31c5fd7...7504026 (2):
  > circular_buffer: add move assignment operator
  > scollectd: Fix serialization of GAUGE-typed values
2016-11-20 20:16:56 +02:00
Gleb Natapov
9222a47fed sstable test: add test for generated summary data
Message-Id: <20161117155051.GV6765@scylladb.com>
2016-11-20 19:50:45 +02:00
Glauber Costa
21c1e2b48c commitlog: wait for pending allocations to finish before closing gate.
allocations may enter the gate, so it would be wise for us to wait for them.

Fixes #1860

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <53cd6996c1cbd8b38bab3b03604bd11e5c20beda.1479650012.git.glauber@scylladb.com>
2016-11-20 19:45:33 +02:00
Avi Kivity
a39b92a40a build: fix tests-with-symbols generation
Bad indentation caused the libs variable for tests-with-symbols to be
overwritten, resulting in link failure.
2016-11-20 17:23:26 +02:00
Glauber Costa
504b5ac30f database: don't check for waiters in the condition variable predicate.
In the last iterations of this patchset, we have moved explicit flushes
to acquire the semaphore directly and the coalescing inside the
memtable_list. As a result, we are no longer keeping any kind of action
for them inside the condition variable. Checking for them has no longer
a purpose.

This is a cleanup patch that remove does checks.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>
2016-11-18 21:34:48 +01:00
Glauber Costa
1933349654 database: fix direct flushes of non-durable column families.
If a Column Family is non-durable, then its flushes will never create a
memtable flush reader. Our current flush logic depends on that being
created and destroyed to release the semaphore permits on the flush.

We will remove the permits ourselves it there is an exception, but not
under normal circumnstances. Given this issue, however, it would be more
adequate to always try to remove the permits after we flush. If the
permits were already removed by the flush reader, then this test will
just see that the permit is not in the map and return. But if it is
still there, then it is removed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>
2016-11-18 21:32:29 +01:00
Avi Kivity
6eecbc80dc CONTRIBUTING.md: add sections for help and issues
Don't scare away users reporting an issue with the CLA.
2016-11-18 22:21:10 +02:00
Glauber Costa
60b7d35f15 commitlog: close file after read, and not at stop
There are other code paths that may interrupt the read in the middle
and bypass stop. It's safer this way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8c32ca2777ce2f44462d141fd582848ac7cf832d.1479477360.git.glauber@scylladb.com>
2016-11-18 14:09:33 +00:00
Paweł Dziepak
249e0ab087 frozen_mutation: use memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
948c062e64 mutation_rebuilder: use memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
e04664e851 partition_snapshot_accounter: use range_tombstone::memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
711bd19f16 keys: add memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
6b8bf030c0 streamed_mutation: add memory_usage() to mutation fragment types
This patch introduces memory_usage() to static_row, clustering_row and
range_tombstone so that we can avoid repeating sizeof(T) +
x.external_memory_usage().

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
ef57b9a26f rename memory_usage() to external_memory_usage() where applicable
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Avi Kivity
fec4ef3390 Merge "Make sure commitlog replay is able to make progress" from Glauber
"Fixes #1856

Commitlog replay reads are being issued without a priority. That means
they will lose to compaction every time."

* 'issue-1856-v2' of github.com:glommer/scylla:
  commitlog: use read ahead for replay requests
  commitlog: use commitlog priority for replay
  commitlog: close replay file
2016-11-18 12:04:18 +02:00
Takuya ASADA
55e5123313 dist/redhat: Support RHEL7
We supported install CentOS7 .rpm on RHEL7, but we haven't supported
building on RHEL7, since there is little difference between CentOS,
and that causes build error.

This patch fixes the error, now we can produce .rpm for RHEL7 wihout
using CentOS.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479431134-8032-1-git-send-email-syuu@scylladb.com>
2016-11-18 11:56:05 +02:00
Glauber Costa
461778918b fix shutdown and exception conditions for flush logic
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
  the explicit flushes no longer wait on the condition variable. So we
  don't.
- We now wait on the stop() flushes (regardless of their return status)
  so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
  to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
  handler

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
2016-11-17 21:16:44 +01:00
Glauber Costa
59a41cf7f1 commitlog: use read ahead for replay requests
Aside from putting the requests in the commitlog class, read ahead
will help us going through the file faster.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:54 -05:00
Glauber Costa
aa375cd33d commitlog: use commitlog priority for replay
Right now replay is being issued with the standard seastar priority.
The rationale for that at the time is that it is an early event that
doesn't really share the disk with anybody.

That is largely untrue now that we start compactions on boot.
Compactions may fight for bandwidth with the commitlog, and with such
low priority the commitlog is guaranteed to lose.

Fixes #1856

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:02 -05:00
Glauber Costa
4d3d774757 commitlog: close replay file
Replay file is opened, so it should be closed. We're not seeing any
problems arising from this, but they may happen. Enabling read ahead in
this stream makes them happen immediately. Fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 12:35:24 -05:00
Avi Kivity
eaf83ab59c Merge seastar upstream
* seastar 3001c08...31c5fd7 (2):
  > Safe use of collectd during shutdown
  > udp: abort reader and writer when udp channel close
2016-11-17 18:44:28 +02:00
Piotr Jastrzebski
9d33948487 mutation_rebuilder: fix fragment size calculation
It wasn't calculating the size of data correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c03dfff7bf1ca3199991e5864189f98bfa2942ea.1479397736.git.piotr@scylladb.com>
2016-11-17 16:23:42 +00:00
Raphael S. Carvalho
3dc9294023 db: do not leak deleted sstable when deletion triggers an exception
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.

The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.

We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.

Fixes #1840.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
2016-11-17 17:46:36 +02:00
Gleb Natapov
c052a1bc4f sstable: use schema's min_index_interval config when generating missing summary
Message-Id: <20161116181937.GA25303@scylladb.com>
2016-11-17 15:24:03 +02:00
Avi Kivity
5d067eebf2 Merge "get rid of memtable size parameter and rework flush logic" from Glauber
"This patchset allows Scylla to determine the size of a memtable instead
of relying in the user-provided memtable_cleanup_threshold. It does that
by allowing the region_group to specify a soft limit which will trigger
the allocation as early as it is reached.

Given that, we'll keep the memtables in memory for as long as it takes
to reach that limit, regardless of the individual size of any single one
of them. That limit is set to 1/4 of dirty memory. That's the same as
last submission, except this time I have run some experiments to gauge
behavior of that versus 1/2 of dirty memory, which was a preferred
theoretical value.

After that is done, the flush logic is reworked to guarantee that
flushes are not initiated if we already have one memtable under flush.
That allow us to better take advantage of coalescing opportunities with
new requests and prevents the pending memtable explosion that is
ultimately responsible for Issue 1817.

I have run mainly two workloads with this. The first one a local RF=1
workload with large partitions, sized 128kB and 100 threads. The results
are:

Before:

op rate                   : 632 [WRITE:632]
partition rate            : 632 [WRITE:632]
row rate                  : 632 [WRITE:632]
latency mean              : 157.8 [WRITE:157.8]
latency median            : 115.5 [WRITE:115.5]
latency 95th percentile   : 486.7 [WRITE:486.7]
latency 99th percentile   : 534.8 [WRITE:534.8]
latency 99.9th percentile : 599.0 [WRITE:599.0]
latency max               : 722.6 [WRITE:722.6]
Total partitions          : 189667 [WRITE:189667]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

After:

op rate                   : 951 [WRITE:951]
partition rate            : 951 [WRITE:951]
row rate                  : 951 [WRITE:951]
latency mean              : 104.8 [WRITE:104.8]
latency median            : 102.5 [WRITE:102.5]
latency 95th percentile   : 155.8 [WRITE:155.8]
latency 99th percentile   : 177.8 [WRITE:177.8]
latency 99.9th percentile : 686.4 [WRITE:686.4]
latency max               : 1081.4 [WRITE:1081.4]
Total partitions          : 285324 [WRITE:285324]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

The other workload was the workload described in #1817. And the result
is that we now have a load that is very stable around 100k ops/s and
hardly any timeouts, instead of the 1.4 baseline of wild variations
around 100k ops/s and lots of timeouts, or the deep reduction of
1.5-rc1."

* 'issue-1817-v4' of github.com:glommer/scylla:
  database: rework memtable flush logic
  get rid of max_memtable_size
  pass a region to dirty_memory_manager accounting API
  memtable: add a method to expose the region_group
  logalloc: allow region group reclaimer to specify a soft limit
  database: remove outdated comment
  database: uphold virtual dirty for system tables.
2016-11-17 14:36:43 +02:00
Avi Kivity
18078bea9b storage_proxy: avoid calculating digest when only one replica is contacted
If we're talking to just one replica, the digest is not going to be used,
so better not to calculate it at all.  The optimization helps with
LOCAL_ONE queries where the result is large, but does not contain large
blobs (many small rows).

This patch adds a digest_algorithm parameter to the READ_DATA verb that
can take on two values: none and MD5 (default), and sets it to none when
we're reading from one replica.

In the future we may add other values for more hardware-friendly digest
algorithms.
Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>
2016-11-17 13:04:30 +02:00
Asias He
dc50ce0ce5 streaming: Make the mutation readers when streaming starts
Currenlty we make the mutation readers for streaming at different
time point, i.e.,

do_for_each(_ranges.begin(), _ranges.end(), [] (auto range) {
     make a mutation reader for this range
     read mutations from the reader and send
})

If there are write workload in the background, we will stream extra
data, since the later the reader is made the more data we need to send.

Fix it by making all the readers before starting to stream.

Fixes #1815
Message-Id: <1479341474-1364-2-git-send-email-asias@scylladb.com>
2016-11-17 12:41:53 +02:00
Gleb Natapov
ae0a2935b4 sstables: fix ad-hoc summary creation
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
2016-11-17 11:05:23 +02:00
Glauber Costa
f08162e181 database: rework memtable flush logic
The way we currently flush memtables, we seal the current one but wait
on a semaphore for the actual flush to proceed.

This is pointless, because if the flush is not proceeding we'll use up
memory for the new entries anyway, be them in a newly opened memtable or
not. As a matter of fact, by opening a new memtable we are foregoing
coalescing opportunities.

After recent changes to the flush paths, we are now in a position to do
differently. We move the semaphore earlier, and if we can't acquire it
we keep appending to the current memtable.

For explicit flushes, we'll queue and prioritize them over memory-based
flushes. This has the nice property of potentially coalescing various
flushes for the same CF into one.

Coalescing flushes for the same CF is particularly helpful for
commitlog-initiated flushes that can't complete within the flush period.
What we see currently, is that under heavy load the commitlog will keep
sealing memtables adding to the existing load.

Another interesting property of this approach is that we can keep the
disk utilization higher, by allowing a new flush to start before the
memtable is fully sealed. By design, every time a memtable is finished
flushing it will call revert_potentially_cleaned_up_memory() to revert
the virtual memory charges.  That is the perfect moment for us to act.
It indicates that all the data flushing part is done.

The way we'll do it is by keeping the semaphore_units alive for this
memtable. When the flush ends, we destroy that object. This will
effectively trigger the next flush if there is a next flush that can be
initiated.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:58 -05:00
Glauber Costa
895e838ac0 get rid of max_memtable_size
After recent changes to the memtable code, there is no reason for us to
uphold a maximum memtable size. Now that we only flush one memtable at a
time anyway, and also have soft limit notifications from the
region_group_reclaimer, we can just set the soft limit to the target
size and let all of that be handled by the dirty_memory_manager.

It does have the added property that we'll be flushing when we globally
reach the soft limit threshold. In conditions in which we have multiple
CF writes fighting for memory, that guarantees that we will start
flushing much earlier than the hard limit.

The threshold is set to 1/4 of dirty memory. While in theory we would
prefer the memtables to go as big as 1/2 of dirty memory, in my
experiments I have found 1/4 to be a better fit, at least for the
moment.

The reason for such behavior is that in situations where we have slow
disks, setting the soft limit to 1/2 of dirty will put us in a situation
in which we may not have finished writing down the memtable when we hit
the limit, and then throttle. When set the threshold to 1/4 of dirty, we
don't throttle at all.

This behavior could potentially be fixed by not doing the full
memtable-based throttling after we do the commitlog throttling, but that
is not something realistic for the moment.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
2ed3f342c1 pass a region to dirty_memory_manager accounting API
We would like to know from which region is a particular flush coming
from, and account accordingly. The reasoning behind that, is that soon
we'll be driving the flushes internally from the dirty_memory_manager
without explcitly triggering them.

We need to start a flush before the current one finishes, otherwise
we'll have a period without significant disk activity when the current
SSTable is being sealed, the caches are being updated, etc.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
0b337dab14 memtable: add a method to expose the region_group
That is technically not needed because a memtable inherits from group. So
whenever we have a memtable, we can use it's group() method to obtain a
group for it, and then from there go to the region_group.

However, region() is a const method in the memtable, so we have to play
trick with the const_cast, or remove the constness from the region. An
alternative to that, which I prefer, is to expose a method for the
region_group directly from the memtable object that does the right thing
and bypasses all that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
f86c9e36f4 logalloc: allow region group reclaimer to specify a soft limit
The region_group_reclaimer will let us know every time we are over the
limit we have specified for memory usage.

However, For some applications, we would be interested in knowing about
memory build up earlier, so we can start doing something about it before
we reach that condition.

This patch introduce soft limit notifications for the
region_group_reclaimer. After this patch is applied, start_reclaim() is
called earlier, and stop_reclaim() later, after the soft condition is
abated.

There are methods that allow one to easily test if the pressure
condition is a soft limit condition or a hard, threshold condition and
act accordingly. Whether to act on both conditions or just one of them
is up to the application.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
da738a6cd1 database: remove outdated comment
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
919de98aa5 database: uphold virtual dirty for system tables.
Currently the virtual dirty mechanism is not properly set for system
tables. We haven't divided the system table allowance by two, which
means it won't start thottling earlier as it was supposed to.

In practice, this has little effect because system table requests are
very well behaved, their sizes well known, and they tend to be
force-flushed. But we should be consistent.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Avi Kivity
f26c6569d2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 61ff5c6...25e101f (1):
  > scylla_install_ami: delete unneeded authorized_keys from AMI image
2016-11-16 22:36:31 +02:00
Takuya ASADA
3802f289f8 dist: remove bc from dependency
Since we replaced shellscript based cpuset generator with python based one,
we no longer depends to bc command.
d571123afd

So drop it from .rpm/.deb dependency.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479152876-11020-1-git-send-email-syuu@scylladb.com>
2016-11-16 15:02:55 +02:00
Amnon Heiman
a4be7afbb0 API: cache_capacity should use uint for summing
Using integer as a type for the map_reduce causes number over overflow.

Fixes #1801

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1479299425-782-1-git-send-email-amnon@scylladb.com>
2016-11-16 13:55:46 +01:00
Avi Kivity
31d0e31de2 Merge seastar upstream
* seastar 47e1821...3001c08 (5):
  > core: Introduce weak_ptr<>
  > timer: Add missing include
  > tutorial: fix TeX template
  > Merge "Adding the metrics layer" from Amnon
  > core/memory: let malloc(0) return a valid pointer
2016-11-16 14:20:49 +02:00
Pekka Enberg
8a4bd6ecd5 README: Guidelines for contributing
Message-Id: <1479288359-14168-1-git-send-email-penberg@scylladb.com>
2016-11-16 12:50:02 +02:00
Paweł Dziepak
f877be50b0 Merge "Keep wide partition cache entry longer than others" from Piotr
"Cache entries for wide partitions are usually smaller than other
entries and the cost of recreating them is higher so it makes sense
to keep them longer than ordinary entries."
2016-11-15 20:44:52 +00:00
Paweł Dziepak
b8d737ff0a tests/row_cache_test: verify that eviction follows lru
Refs #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:57:54 +01:00
Paweł Dziepak
999dafbe57 row_cache: touch entries read during range queries
Fixes #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479230809-27547-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:54:11 +01:00
Tomasz Grabiec
11c5f4ab50 storage_proxy: Add counters for throttled writes 2016-11-15 17:18:25 +01:00
Piotr Jastrzebski
5ec668c9c6 Add separate LRU for wide partitions.
Evict wide partitions only every 1000 normal partition
evictions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 16:19:13 +01:00
Piotr Jastrzebski
9a41bfbf69 Add collectd metric for wide partition evictions.
This will allow us to see how big is an amount
of evictions of cached info about wide partitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 15:53:14 +01:00
Paweł Dziepak
055d78ee4c query_pagers: distinct queries do not have clustering keys
Query pager needs to handle results that contain partitions with
possibly multiple clustering rows quite differently than results with
just one row per partition (for example a page may end in a middle of
partition). However, the logic dealing with partitions with clustering
rows doesn't work correctly for SELECT DISTINCT queries, which are
much more similar to the ones for schemas without clustering key.

The solution is to set _has_clustering_keys to false in case of SELECT
DISTINCT queries regardless of the schema which will make pager
correctly expect each partition to return at most one rows.

Fixes #1822.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478612486-13421-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 11:06:01 +01:00
Glauber Costa
93386bcec7 histograms: do not use latency_in_nano
Now that the histogram has its own unit expressed in its template
parameter, there is no reason to convert it to nano just so we may need
to convert it back if the histogram needs another unit.

This patch will keep everything as a duration until last moment, and
then we'll convert when needed.

This was suggested by Amnon.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>
2016-11-14 18:01:43 +02:00
Nadav Har'El
c5254b6502 repair: fix undefined variable
If the "trace" parameter of the repair was not given, we will use the
"trace" variable without setting it. We need to set a default value.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1479136239-14204-1-git-send-email-nyh@scylladb.com>
2016-11-14 17:16:19 +02:00
Raphael S. Carvalho
e86de40b49 compaction_manager: inform about compaction cancelled by shutdown
After some changes in compaction manager, user no longer is informed
that compaction was cancelled in event of shutdown. That's because
we only ignore ready future when compaction manager was asked to
stop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <02ca29b5a93fe3a558896598f325b0dce069e82c.1478277317.git.raphaelsc@scylladb.com>
2016-11-14 16:37:33 +02:00
Piotr Jastrzebski
4fe989d58e Cleanup sstables::mutation_reader::impl
Pointer to sstable seems unnecessary.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a45e8853af2b5f896ec44144fbc26d3325a5ec0c.1479123740.git.piotr@scylladb.com>
2016-11-14 11:52:52 +00:00
Avi Kivity
14c1b17105 storage_service: fix construct_range_to_endpoint_map with semi-infinite range
After the conversion to nonwrapping ranges, construct_range_to_endpoint_map()
may be called with semi-infinite token ranges, but it does not expect this,
calling nonwrapping_range::end()->value() unconditionally.

Fix by checking whether this is a semi-infinite range on the right, and
replace ->value() by maximum_token() instead.

Fixes `nodetool describering` (once more).
Message-Id: <1478983010-29630-1-git-send-email-avi@scylladb.com>
2016-11-14 11:39:48 +01:00
Raphael S. Carvalho
9a9f0d3a0f main: fix exception handling when initializing data or commitlog dirs
Exception handling was broken because after io checker, storage_io_error
exception is wrapped around system error exceptions. Also the message
when handling exception wasn't precise enough for all cases. For example,
lack of permission to write to existing data directory.

Fixes #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>
2016-11-14 12:34:10 +02:00
Takuya ASADA
d571123afd dist/common/scripts/scylla_sysconfig_setup: stop using 'bc' command to generate cpuset parameter, use python script instead
We get error from bc command when we run the script on >34 ncpus,
to prevent the issue add a python script to generate cpuset parameter.

Fixes #1824

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478887624-12737-1-git-send-email-syuu@scylladb.com>
2016-11-14 11:45:23 +02:00
Duarte Nunes
66f6a367a4 ring_position_range_sharder: Avoid copying eagerly
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161104115632.15974-1-duarte@scylladb.com>
2016-11-13 11:42:23 +02:00
Avi Kivity
bf20aa722b Merge "Fixes for histogram and moving average calculations" from Glauber
"JMX metrics were found to be either not showing, or showing absurd
values.  Turns out there were multiple things wrong with them. The
patches were sent separately but conflict with one another. This series
is a collection of the patches needed to fix the issues we saw.

Fixes #1832, #1836, #1837"
2016-11-13 11:16:32 +02:00
Avi Kivity
2670e46f3e storage_service: deinline most methods
Most inline methods in storage_service are too large to be inlined, and
just increase compile time.  De-inline them.
2016-11-12 21:12:28 +02:00
Glauber Costa
608d825790 histogram: fix reporting units
We are tracking latencies in microseconds, but almost everywhere else
they are reported in microseconds. Instead of just converting, this
patch tries to be a bit more future proof and embed the unit into the
type - and we then default to microseconds.

I have verified that the JMX measures now report sane values for both
the storage proxy and the column family. nodetool cfhistograms still
works fine. That one is reported in nanoseconds, but through the
estimated_histogram, not ihistogram.

Fixes #1836

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-11 11:36:56 -05:00
Glauber Costa
1342d044eb moving averages: change metrics calculation
We have recently fixed a bug due to which the constructor parameters for
moving average were inverted, leading to the numbers being just plain
wrong. However, the calculation of alpha was already inverted, meaning
it was right by accident and now that's wrong.

With the wrong alpha, the values we see are still correct, but they move
very quickly. The intention of this code is obviously to smooth things
out.

This was found out by Nadav. I have tested and confirmed that the smoothing
factor now works as expected.

Fixes  #1837

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-10 22:33:34 -05:00
Amnon Heiman
a977ea85e1 histogram: moving_average and total rate should be calculate in seconds
The moving average and the total average should be calculated in seconds
and not nanoseconds.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-11-10 22:32:53 -05:00
Glauber Costa
d3f11fbabf histogram: moving averages: fix inverted parameters
moving_averages constructor is defined like this:

    moving_average(latency_counter::duration interval, latency_counter::duration tick_interval)

But when it is time to initialize them, we do this:

	... {tick_interval(), std::chrono::minutes(1)} ...

As it can be seen, the interval and tick interval are inverted. This
leads to the metrics being assigned bogus values.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>
2016-11-10 11:28:51 -08:00
Paweł Dziepak
f16d6f9c40 partition_version: make sure that snapshot is destroyed under LSA
Snapshot destructor may free some objects managed by the LSA. That's why
partition_snapshot_reader destructor explicitly destroys the snapshot it
uses. However, it was possible that exception thrown by _read_section
prevented that from happenning making snapshot destoryed implicitly
without current allocator set to LSA.

Refs #1831.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478778570-2795-1-git-send-email-pdziepak@scylladb.com>
2016-11-10 13:13:10 +01:00
Gleb Natapov
27e041606b fix LOCAL_ONE printout
Message-Id: <20161109125307.GH7766@scylladb.com>
2016-11-09 12:53:55 +00:00
Duarte Nunes
e680587b8a sstable_test: Be explicit about uncompressed tables
After 7c28ed, the schemas defined in the test became compressed by
default. This patch changes the test so that it is explicit about
which schemas shouldn't define a compressor.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478646530-5558-1-git-send-email-duarte@scylladb.com>
2016-11-09 11:21:59 +02:00
Pekka Enberg
b3dea313dd Merge "API changes for Cassandra 3.x migration" from Calle
"Mostly small changes/additions to the API calls to match Cv3
 requirements/semantics, i.e. updated scylla-jmx can implement required
 nodetool etc calls in a working fashion."
2016-11-09 10:30:32 +02:00
Duarte Nunes
e33c02aa60 cql3: Disable compression on empty properties
The CQL 3.1 documentation specifies that for disabling compression,
users should use an empty string:

ALTER TABLE mytable WITH COMPRESSION = {'sstable_compression': ''};

However, Cassandra also accepts the absence of the sstable_compression
option to disable compression. The patch 7c28ed prevented this behavior
in Scylla, which this patch aims to fix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478639499-4183-1-git-send-email-duarte@scylladb.com>
2016-11-09 10:03:59 +02:00
Gleb Natapov
93f068bd44 storage_proxy: fix speculation target selection logic
Current speculation target selection logic has several bugs in multi-dc
setup. It may select a non local target for CL=LOCAL and it may select
more than one target to speculate, one of which is non local.

Examples:

1. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_QUORUM.

In this scenario db::filter_for_query() will return both replicas from
local DC and speculation target selection logic will peek one one which
will be in different DC.

2. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_ONE + RRD.DC_LOCAL

In this scenario db::filter_for_query() will return all nodes in local DC and
there already be enough nodes to speculate, but current logic will add
one node from non local dc as a speculation target.

The patch below fixed both of those scenarios.

Message-Id: <20161103154637.GS7766@scylladb.com>
2016-11-08 18:32:47 +01:00
Paweł Dziepak
a8308e2a8d row_cache: dummy entry does not count as partition
Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>
2016-11-08 13:54:44 +01:00
Piotr Jastrzebski
50b41f7d1d Fix row_cache_test
partition_range passed to row_cache::make_reader
has to be kept alive as long as the resulting reader
is used.

Otherwise weird things start to happen.

This used to work just because of a pure luck.
When I started changing the row_cache implementation
I run into very weird behaviors for this tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>
2016-11-08 13:28:53 +01:00
Calle Wilund
473326d49a api/column_family: Make mean row size return integral
As (at least) per C3, these metrics are integral in origin. Adapt.
(Other option would be to translate in jmx).
2016-11-08 12:22:04 +00:00
Calle Wilund
bd646a6755 repair (api): Add option handling (sort of) for nodetool default options 2016-11-08 12:22:04 +00:00
Calle Wilund
0181fc8159 api::cache_service: Add (dummy) calls for key&counter metrics 2016-11-08 12:22:04 +00:00
Calle Wilund
5eb54f9bc4 api::storage_service: c3 compat - make query keyspaces a trinary choice
all, user or non-local strategy ones.
2016-11-08 12:22:04 +00:00
Calle Wilund
3b7a7dd383 api::failure_detector: c3 compat - add endpoint phi value query 2016-11-08 12:22:04 +00:00
Calle Wilund
218df55349 failure_detector: add accessor and api shortcut for arrival samples 2016-11-08 12:22:04 +00:00
Calle Wilund
f9836cd23b api::endpoint_snitch: c3 compat - allow dc/rack query for broadcast 2016-11-08 12:22:04 +00:00
Calle Wilund
54ba06a8bf api::column_family: Add calls/parameters for c3 compatibility 2016-11-08 12:22:04 +00:00
Amnon Heiman
c8082ccadb API: fix a type in storage_proxy
This patch fixes a typo in the URL definition, causing the metric in the
jmx not to find it.

Fixes #1821

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1478563869-20504-1-git-send-email-amnon@scylladb.com>
2016-11-08 11:09:21 +02:00
Amos Kong
95fe88c1d3 scripts/scylla_current_repo: use HTTP to access downloads.scylladb.com
Https isn't available for downloads.scylladb.com, or we can access
it by https://s3.amazonaws.com/downloads.scylladb.com/...

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <d4b65e1724bbeb76c928790d5d3e95b91ee9db79.1478153034.git.amos@scylladb.com>
2016-11-08 11:03:50 +02:00
Avi Kivity
767cfb4fe9 storage_service: fix range wrapping in describe_ring even more
Commit 8fca1887c2 ("storage_service: fix range wrapping in
describe_ring") fixed incorrect range wrapping code for describe_ring,
but fails when the number of endpoints for a token is greater than one,
because the endpoints are stored in an unordered vector.

Fix by comparing the endpoints in a way that ignores their order.
Message-Id: <1478460826-15923-1-git-send-email-avi@scylladb.com>
2016-11-07 16:18:20 +01:00
Calle Wilund
11baf37ab5 commitlog: Prevent exceptions in stream::produce from being set twice
Fixes #1775
stream lacks a check "is_open", which is a bummer. We have to both
prevent exception propagation and add a flag of our own to make sure
exceptions in producer code reaches consumer, and does not simply
get lost in the reactor.
Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>
2016-11-07 11:41:33 +01:00
Tomasz Grabiec
e6cc0a2e10 Merge branch '1766/v1' from duarten/scylla.git
This patchset adds missing properties to the create_view_statement,
such as whether the view is compact or the order of its clustering
columns.

Fixes #1766
2016-11-07 10:44:24 +01:00
Takuya ASADA
0f1ba1a3bb dist/redhat: remove unused dependencies
Seems like we mistakenly added unneeded packages for BuildRequires when
we created .spec file, so remove them.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478504761-15067-1-git-send-email-syuu@scylladb.com>
2016-11-07 09:48:50 +02:00
Paweł Dziepak
985d2f6d4a Merge "Remove quadratic behavior from atomic sstable deletion" from Avi
"The atomic sstable deletion provides exception safety at the cost of
quadratic behavior in the number of sstables awaiting deletion.  This
causes high cpu utilization during startup.

Change the code to avoid quadratic complexity, and add some unit tests.

See #1812."
2016-11-04 15:48:04 +00:00
Avi Kivity
f75aceabc5 sstables: add unit tests for atomic deletion
We simulate shards deleting sstables, but this is all happening on a single
core, and no sstables are harmed during test execution.
2016-11-04 15:48:43 +02:00
Avi Kivity
f10b9906d8 sstables: move atomic deletion code to its own files
This will simplify unit testing.  We move generic code that
depends only on seastar, so compile time should not increase too much.
2016-11-04 15:47:35 +02:00
Avi Kivity
9e85653c33 sstables: make atomic_deletion_manager more abstract
Make the shard count and method of deleting sstables abstract, in order
not to require all that machinery for unit tests.
2016-11-04 15:44:09 +02:00
Avi Kivity
e527da1e3c sstables: wrap atomic deletion code in a class
This makes it easier to abstract and unit-test.
2016-11-04 15:44:07 +02:00
Avi Kivity
a05837936a sstables: remove quadratic behavior from atomic sstable deletions
In order to ensure exception safety, the atomic sstable deletion code
creates a copy of the list of sstables pending deletion, modifies that
copy, and then replaces the original data with the copy.  This guarantees
that any exception does not change the data, since the assignment does
not require allocation.

However, it does result in quadratic behavior.  During startup, all
sstables are loaded on each shard, and each shard deletes sstables that
are do not have any partitions served by that shard; this results in
almost all sstables being deleted from all shards, with all that work
going to shard 0; the list grows to O(nr sstables), and there are
O((nr sstables) * (nr shards)) operations to perform.

Fix by replacing the copy-modify-assign method with an in-place update,
but one that is designed to only commit changes after all allocations
have been made; in addition, instead of using a list, use a hash table,
removing another source of quadratic behavior.

Fixes #1812 (the quadratic beahvior part).
2016-11-04 15:42:44 +02:00
Avi Kivity
8fca1887c2 storage_service: fix range wrapping in describe_ring
describe_ring() tries to re-wrap the ranges, but fails because the ranges
are not sorted.  Adjust the code not to rely on sorting.
Message-Id: <1478198630-27483-1-git-send-email-avi@scylladb.com>
2016-11-04 10:48:14 +00:00
Paweł Dziepak
8afd9e52c7 Merge "Process range queries sequentially on shards" from Avi
"Currently, partition range queries are processed in parallel on all
shards. This is inefficient because we are likely to drop the results
from all but one shard, assuming a well-populated column family.  We
are multiplying our work by a factor of smp::count.

While this is worthwhile in its own right, it is really an excuse to
sneak in the range/shard generator (patch 5), which is preliminary for
a new sharding algorithm, dividing tokens among shards based on the
middle-significant bits rather than the most-siginificant bits (which
alias with vnodes)

Fixes #1573."
2016-11-04 09:58:04 +00:00
Tomasz Grabiec
c1a7e2090e Revert "database: change find_column_families signature so it returns a lw_shared_ptr"
This reverts commit f3528ede65.
2016-11-04 10:48:21 +01:00
Tomasz Grabiec
3b5ccda70e Revert "database: refactor code so apply_in_memory() is called only once"
This reverts commit 3f825f593d.
2016-11-04 10:48:18 +01:00
Tomasz Grabiec
6366eb5cf8 Revert "correctly calculate latencies for writes"
This reverts commit a382f10fc4.
2016-11-04 10:48:02 +01:00
Tomasz Grabiec
a5ee87611a Revert "database: when querying, move latency counter instead of copying"
This reverts commit 8840a5a593.
2016-11-04 10:47:58 +01:00
Tomasz Grabiec
f3c1ff78e6 Merge branch 'cql_read_write_counters-v4' from seastar-dev.git
New CQL counters from Vlad.
2016-11-04 09:19:07 +01:00
Avi Kivity
b3299d5bc3 storage_proxy: simplify range queries
Instead of asking a shard for cmd->partition_limit and cmd->row_limit,
just ask it for the number of partitions and rows still needed to
satisfy the query.  This removes the need to trim the shard's result.
2016-11-03 19:10:20 +02:00
Avi Kivity
a668e575f6 storage_proxy: execute multi-partition query sequentially over shards
Since every shard might cause the row_limit quota to be satisfied, every
shard might be the last one we need.  Hence it is better to process shards
sequentially, stopping if the quota is reached or the range is exhausted.

The original code tried to yield to reduce latency, but this is now
unnecessary, as we're doing a lot less work per iteration (if it becomes
necessary, we should do it on the replica shard, not the coordinating shard).
2016-11-03 19:10:20 +02:00
Avi Kivity
1d77e3a03a partitioner: add unit tests for token_for_next_shard()
i_partitioner::token_for_next_shard() is an inverse for
i_partitioner::shard_of(), test that this is so.
2016-11-03 19:10:20 +02:00
Avi Kivity
7202b94183 dht: introduce a sharder for vectors of partition ranges
Building on the single-range sharder, add a sharder for vectors of
partition ranges.  This helps with wrapped ranges, which are translated
into a vector containing two shards.
2016-11-03 19:10:20 +02:00
Avi Kivity
43a2380899 dht: add a generator for shard/range pairs
Divides a ring_position range into a sequence of shard/range pairs.  This
allows sequential iteration over shards in ring order.

The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page.  With the generator, we can switch to sequential
execution.
2016-11-03 19:10:17 +02:00
Avi Kivity
1f88d103a8 partitioner: add i_partitioner::token_for_next_shard()
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.

To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.

It is a sort-of inverse to shard_of(), in that

  shard_of(token_for_next_range(t)) == shard_of(t) + 1
2016-11-03 19:09:23 +02:00
Vlad Zolotarov
6c15dd967a cql3::query_processor: make the collectd metrics registration nicer
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:20 -04:00
Vlad Zolotarov
36cc351ae1 cql3::query_processor: add a counter for BATCH CQL statements
- Add a "batches" member to cql_stats.
   - Update it where appropriate.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:20 -04:00
Vlad Zolotarov
6e1d27bed1 cql3::query_processor: add a counter for a number of CQL modification requests ("writes")
- Add a inserts, updates, deletes members to cql_stats.
   - Store cql_stats& in a modification_statement and increment the corresponding counter according to the value of a "type" field.
   - Store cql_stats& in a batch_statement and increment the statistics for each BATCH member.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:15 -04:00
Vlad Zolotarov
fa4e1db0cb cql: add a counter for CQL read (SELECT) requests
- Add a "reads" counter to a cql3::cql_stats struct.
   - Store a reference for a query_processor::_cql_stats in the select_statement object.
   - Increment a "reads" counter where needed.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:48:57 -04:00
Vlad Zolotarov
7606588267 cql3::query_processor: add cql_stats
- Add cql_stats member.
   - Pass it to cql3::raw::parsed_statement::prepare() virtual method.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:48:57 -04:00
Glauber Costa
8840a5a593 database: when querying, move latency counter instead of copying
It is comprised of two time points. Let's move it instead of copying it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c7c155c77780e188bfbe05881c81ce86456016d5.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
a382f10fc4 correctly calculate latencies for writes
Right now we are calculating latencies only when we are about to add an
item to the memtable.

That's incorrect and misleading, for two reasons. First, it leaves the
commitlog latencies out. But second, it is done after the memtable wall
effect is applied, which means we are not counting throttle time neither
in the memtables or in the commitlog.

To do that, we'll start the latency_counter object as soon as possible
and move it all the way to apply_in_memory(). That should span the
entire write operation.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
3f825f593d database: refactor code so apply_in_memory() is called only once
There are two variants of apply_in_memory() being called in do_apply():
with and without the commitlog. The main differences are that when the
commitlog is involved, we need to wait for its future to complete before
moving to apply_in_memory. That can easily be factored out by providing
an always-ready future if we don't have the commitlog enabled, and
waiting on that.

The second, is that the commitlog version can cause apply_in_memory to
generate an exception if there is replay position reordering. However,
there is no harm in appending the exception handler to both versions. In
one of them it's an impossible exception, but that's fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8cee0cad9b1930a057a24e095f0a655069ae8be2.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
f3528ede65 database: change find_column_families signature so it returns a lw_shared_ptr
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.

If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Avi Kivity
6c45b0bae8 partitioner: make comparators public
The public comparison operators depend on global_partitioner(), and are
therefore less useful for tests.
2016-11-03 11:27:40 +02:00
Avi Kivity
6320181b97 partitioner: const correctness for comparators 2016-11-03 11:27:40 +02:00
Avi Kivity
470826d127 partitioner: change partitioners to have shard counts independent from smp::count
Useful for testing.
2016-11-03 11:27:40 +02:00
Avi Kivity
75706c0a26 size_estimates_recorder: sort token range before rewrapping it
Since size estimates are stored as wrapped ranges, we call compat::wrap()
to convert from the now-standard unwrapped ranges back to wrapped ranges.
However, compat::wrap() relies on the ranges being in sorted order,
but our input is not.  This leads to a crash as we find an unexpected
empty token in the middle of the vector.

Sort it so compat::wrap() works as expected.

Fixes #1804.
Message-Id: <1478161908-25051-1-git-send-email-avi@scylladb.com>
2016-11-03 09:43:41 +01:00
Avi Kivity
a35136533d Convert ring_position and token ranges to be nonwrapping
Wrapping ranges are a pain, so we are moving wrap handling to the edges.

Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.

Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
2016-11-02 21:04:11 +02:00
Takuya ASADA
8c55c99353 dist/common/scripts/scylla_io_setup: pass --smp option to iotune command
We were ignored --smp option taken from io.conf since iotune didn't supported
it, but now it supported we can pass it.
(We need to pass it because we need to measure io performance on same condition
with scylla)

Fixes #1768

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478082591-27205-1-git-send-email-syuu@scylladb.com>
2016-11-02 12:49:50 +02:00
Raphael S. Carvalho
53b7b7def3 sstables: handle unrecognized sstable component
As in C*, unrecognized sstable components should be ignored when
loading a sstable. At the moment, Scylla fails to do so and will
not boot as a result. In addition, unknown components should be
remembered when moving a sstable or changing its generation.

Fixes #1780.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b7af0c28e5b574fd577a7a1d28fb006ac197aa0a.1478025930.git.raphaelsc@scylladb.com>
2016-11-02 12:44:53 +02:00
Avi Kivity
72c2982260 dist: require scylla-boost-static for EL RPM build 2016-11-01 18:55:55 +02:00
Pekka Enberg
e1e8ca2788 cql3: Fix selecting same column multiple times
Under the hood, the selectable::add_and_get_index() function
deliberately filters out duplicate columns. This causes
simple_selector::get_output_row() to return a row with all duplicate
columns filtered out, which triggers and assertion because of row
mismatch with metadata (which contains the duplicate columns).

The fix is rather simple: just make selection::from_selectors() use
selection_with_processing if the number of selectors and column
definitions doesn't match -- like Apache Cassandra does.

Fixes #1367
Message-Id: <1477989740-6485-1-git-send-email-penberg@scylladb.com>
2016-11-01 09:09:01 +00:00
Pekka Enberg
d46ed53e9e scripts: add update-version
This patch adds an `update-version` script for updating the Scylla
version number in `SCYLLA-VERSION-GEN` file and committing the change to
git.

Example use:

  $ ./scripts/update-version 1.4.0

which results into the following git commit:

  commit 4599c16d9292d8d9299b40a3e44ef7ee80e3c3cf
  Author: Pekka Enberg <penberg@scylladb.com>
  Date:   Fri Oct 28 10:24:52 2016 +0300

      release: prepare for 1.4.0

  diff --git a/SCYLLA-VERSION-GEN b/SCYLLA-VERSION-GEN
  index 753c982..eba2da4 100755
  --- a/SCYLLA-VERSION-GEN
  +++ b/SCYLLA-VERSION-GEN
  @@ -1,6 +1,6 @@
   #!/bin/sh

  -VERSION=666.development
  +VERSION=1.4.0

   if test -f version
   then

Message-Id: <1477639560-10896-1-git-send-email-penberg@scylladb.com>
2016-10-30 12:43:41 +02:00
Avi Kivity
feb8faf70b Merge "make refresh resilient to permission denied error" from Raphael
Fixes #1709.

* 'refresh-resilient-v3' of github.com:raphaelsc/scylla:
  db: make refresh resilient to permission denied error
  db: make it possible to use custom error handler with io checker
  sstables: remove duplicated declaration of remove_by_toc_name
2016-10-30 10:28:09 +02:00
Takuya ASADA
68d9f5212c dist/ubuntu/dep/thrift.diff: add missing build time dependency
We need libcrypto header to build thrift, so add it.

Fixes #1798

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477676716-5726-1-git-send-email-syuu@scylladb.com>
2016-10-29 17:49:30 +03:00
Avi Kivity
71532d8cd5 Merge seastar upstream
* seastar 05f6c5c...47e1821 (1):
  > rpc: Avoid using zero-copy interface of output_stream (Fixes #1786)
2016-10-28 14:09:16 +03:00
Avi Kivity
e03ca06431 dist: fix rpm build
--static-boost is supposed to be an input to ./configure.py, not ninja.  Move
it there.
2016-10-28 08:42:26 +03:00
Pekka Enberg
b54870764f auth: Fix resource level handling
We use `data_resource` class in the CQL parser, which let's users refer
to a table resource without specifying a keyspace. This asserts out in
get_level() for no good reason as we already know the intented level
based on the constructor. Therefore, change `data_resource` to track the
level like upstream Cassandra does and use that.

Fixes #1790

Message-Id: <1477599169-2945-1-git-send-email-penberg@scylladb.com>
2016-10-27 23:37:26 +03:00
Glauber Costa
ef3c7ab38e auth: always convert string to upper case before comparing
We store all auth perm strings in upper case, but the user might very
well pass this in upper case.

We could use a standard key comparator / hash here, but since the
strings tend to be small, the new sstring will likely be allocated in
the stack here and this approach yields significantly less code.

Fixes #1791.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <51df92451e6e0a6325a005c19c95eaa55270da61.1477594199.git.glauber@scylladb.com>
2016-10-27 22:08:57 +03:00
Raphael S. Carvalho
d11e839520 db: make refresh resilient to permission denied error
User may forget to set permission of new sstables in upload dir
before refreshing them, and that will result in shutdown.
io_checker is now able to work with a custom handler, so all we
have to do is to whitelist EACCES.

Fixes #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 16:50:40 -02:00
Raphael S. Carvalho
a3e065da9b db: make it possible to use custom error handler with io checker
By default, io checker will cause Scylla to shutdown if it finds
specific system errors. Right now, io checker isn't flexible
enough to allow a specialized handler. For example, we don't want
to Scylla to shutdown if there's an permission problem when
uploading new files from upload dir. This desired flexibility is
made possible here by allowing a handler parameter to io check
functions and also changing existing code to take advantage of it.
That's a step towards fixing #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 15:54:21 -02:00
Takuya ASADA
a1b7e76d43 dist/ubuntu: support 16.10
Add 16.10 to 'supported_release'

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477585454-2115-1-git-send-email-syuu@scylladb.com>
2016-10-27 19:26:14 +03:00
Takuya ASADA
36e831a106 dist/common/scripts/scylla_bootparam_setup: support EC2 paravirtual instances
EC2 paravirtual instances uses pv-grub, which refers /boot/grub/menu.lst (grub0.9x config file) instead of grub2 config file.
So add boot parameters on /boot/grub/menu.lst when the file exists, and the instance is on EC2.

Fixes #1598

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472056875-17512-1-git-send-email-syuu@scylladb.com>
2016-10-27 18:55:05 +03:00
Avi Kivity
402a3f1c9f Merge seastar upstream
* seastar 9bed76a...05f6c5c (5):
  > reactor: improve task quota timer resolution
  > Update dpdk submodule to local-patches-20161027 tag
  > tests: wire up json_formatter_test
  > json_formatter_test: Add rudimentary json formatter test
  > scripts/posix_net_conf.sh: detect IRQs of virtio-net and xen_netfront correctly
2016-10-27 18:19:40 +03:00
Avi Kivity
e995f5a3a7 dist: statically link with boost on RHEL
Reduces runtime dependencies on Scylla-provided third-party boost packages.

Message-Id: <1477552490-28961-1-git-send-email-avi@scylladb.com>
2016-10-27 12:35:12 +03:00
Avi Kivity
76628a7b0b dist: make wget quieter
wget is often used from scripts recording to logs; as it emits a log
line every second, the logs are huge and unreadable.  Make it quieter.

Message-Id: <1477558534-32718-1-git-send-email-avi@scylladb.com>
2016-10-27 12:11:26 +03:00
Avi Kivity
72d78ffa7e Merge "Cache fixes" from Paweł
"5ff699e09fcbd62611e78b9de601f6c8636ab2f0 ("row_cache: rework cache to
use fast forwarding reader") brought some significant changes to the
row cache implementation. Unfortunately, "significant changes" often
translates to "more bugs" and this time was no different.

This series contains fixes for the problems introduced in that rework
and makes failing dtest
bootstrap_test.py:TestBootstrap.local_quorum_bootstrap_test
pass again."

* 'pdziepak/cache-fixes/v1' of github.com:cloudius-systems/seastar-dev:
  row_cache: avoid dereferencing invalid iterator
  row_cache: set _first_element flag correctly
  row_cache: fix clearing continuity flag at eviction
2016-10-27 11:44:15 +03:00
Takuya ASADA
5cb7dc5dc3 dist/ubuntu/dep: update thrift to 0.9.3
To make thrift compilable on gcc-6.2, we need to upgrade latest version of
thrift.
This is required to support Ubuntu 16.10.

Fixes #1784

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477517671-18067-1-git-send-email-syuu@scylladb.com>
2016-10-27 10:22:06 +03:00
Paweł Dziepak
a7224ae46e row_cache: avoid dereferencing invalid iterator
Conditions in row_cache::do_find_or_create_entry() make it possible that
std::prev(it) is going to be dereferenced even if it is a begin
iterator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:24:23 +01:00
Paweł Dziepak
654f651e0c row_cache: set _first_element flag correctly
If the continuity flag was set for the first element _first_element flag
would not be cleared. This shouldn't cause any correctness problems but
properly setting the flag allows to avoid some unnecessary key
comparisons.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:07:24 +01:00
Paweł Dziepak
567ff96f2a row_cache: fix clearing continuity flag at eviction
In original implementation the continuity flag indicated that cache has
full information about the range the between current partition and the
one following it, hence when evicting an entry the one preceeding it
had to have its continuity flag cleared.

This was changed, however, and now the continuiy flag tells whether the
cache is continuous between the current element and the one before it.
This means that eviction code needs to clear the flag for the entry
directly following the evicted one.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 14:58:20 +01:00
Raphael S. Carvalho
bc2d351c25 sstables: remove duplicated declaration of remove_by_toc_name
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-26 11:21:27 -02:00
Takuya ASADA
7617adadf4 dist/ami/files/.bash_profile: fix confusing message when running AMI on unsupported instance type
To describe witch instance type is supported, show document URL instead of
confusing message.

Fixes #1646

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477473336-25373-1-git-send-email-syuu@scylladb.com>
2016-10-26 12:48:51 +03:00
Avi Kivity
7faf2eed2f build: support for linking statically with boost
Remove assumptions in the build system about dynamically linked boost unit
tests.  Includes seastar update which would have otherwise broken the
build.
2016-10-26 08:51:21 +03:00
Piotr Jastrzebski
27726cecff Clean up position_in_partition.
Introduce position_in_partition_view and use it in
position() method in mutation_fragment, range_tombstone,
static_row and clustering_row.
Clean up comparators in position_in_partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c65293c71a6aa23cf930ed317fb63df1fdc34fd1.1477399763.git.piotr@scylladb.com>
2016-10-25 15:13:20 +01:00
Tomasz Grabiec
cbaae2bf7f Merge seastar upstream
* seastar e18205b...3777135 (1):
  > rpc: Do not close client connection on error response for a timed out request

Fixes #1778
2016-10-25 13:59:41 +02:00
Raphael S. Carvalho
975ce62dbc sstables: do not swallow exception when reading TOC
That caused problem when refreshing a sstable with bad permissions.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <48e5322c53234209e55da05c64c99b8ec4e190a3.1477372974.git.raphaelsc@scylladb.com>
2016-10-25 12:21:32 +03:00
Avi Kivity
ddd4dbf928 Update scylla-ami submodule
* dist/ami/files/scylla-ami e1e3919...61ff5c6 (1):
  > scylla_ami_setup: run posix_net_conf.sh when NCPUS < 8
2016-10-25 11:18:58 +03:00
Avi Kivity
4b55a687b6 Merge seastar upstream
* seastar 98b5a2d...e18205b (1):
  > json::formatter: Add formatters for maps + rudimentary test
2016-10-25 11:17:29 +03:00
Avi Kivity
e8edaaf6a4 Merge seastar upstream
* seastar 69acec1...98b5a2d (9):
  > rpc: Silence warning about ignored failed future
  > future: prioritise continuations that can run immediately
  > iotune: relax aio restrictions
  > build: support for static linking with boost
  > rpc: Fix crash during connection teardown
  > rpc: Move _connected flag to protocol::connection
  > rpc test: fail test if exception is thrown during test execution
  > rpc: do not assume underling semaphore type
  > rpc: fix default resource limit
2016-10-25 11:09:40 +03:00
Avi Kivity
fc8210a875 tests: fix tests with boost 1.60
In boost 1.60, the executable's command-line arguments are expected to
be separated from the boost command-line arguments by '--'.  Detect
this requirement and comply with it.
Message-Id: <1477212424-3831-1-git-send-email-avi@scylladb.com>
2016-10-24 09:36:56 +02:00
Avi Kivity
37f112b610 dist: add python3-yaml to ununtu dependencies for blocktune 2016-10-23 16:42:13 +03:00
Avi Kivity
7d50d6df9b blocktune: fix syntax error in exception handling 2016-10-23 16:40:00 +03:00
Avi Kivity
e261a380a9 dist: add PyYAML dependency to rpm (for blocktune) 2016-10-23 10:36:29 +03:00
Raphael S. Carvalho
fa308c079c database: fix collectd metrics for clustering key filter
Same instance name was used for exported metrics, which is
definitely wrong. Checked it works properly now via collectd
exporter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <471a36706113af60aeba86fb56a365feb4dab31a.1477086706.git.raphaelsc@scylladb.com>
2016-10-22 09:51:18 +03:00
Glauber Costa
a13c410749 commitlog: cycle based on total size, not on mutation size
We calculate two sizes during the allocation: "size", which is the
in-segment size of this mutation, and "s", which is that plus the
overhead. cycle() must be called with the latter, not the former, as
doing otherwise may lead to buffer overflows.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ccf346d8d0ebb44a1ba9fd069653bab0d7be0a61.1477063157.git.glauber@scylladb.com>
2016-10-21 18:57:41 +03:00
Glauber Costa
d9875784a1 commitlog: do not wait on pending operations for batch mode
This was explicitly mentioned in my set as gone in one of the versions.
Somehow it came back in the final version - sorry about that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2a0eba28cd74267d1a1fdcf1aef2901cc74ffc9f.1477059963.git.glauber@scylladb.com>
2016-10-21 17:27:16 +03:00
Vlad Zolotarov
f75a350a8f service::storage_proxy: use global_trace_state_ptr when using invoke_on
When trace_state may migrate to a different shard a global_trace_state_ptr
has to be used.

This patch completes the patch below:

commit 7e180c7bd3
Author: Vlad Zolotarov <vladz@cloudius-systems.com>
Date:   Tue Sep 20 19:09:27 2016 +0300

    tracing: introduce the tracing::global_trace_state_ptr class

Fixes #1770

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1476993537-27388-1-git-send-email-vladz@cloudius-systems.com>
2016-10-21 11:34:13 +03:00
Avi Kivity
e3ae54f0fe Merge "Rework commitlog to avoid timeouts" from Glauber
"This patchset reworks the commitlog logic to better handle conditions in
which we are getting requests faster than the disk can handle. It does
this by building a wall around the commitlog and only allowing
allocations to proceed when we are under the desired memory threshold.

The main advantage of that is that we can now easily set the commitlog
to work at disk speed, more or less allowing an "one byte in for each
byte out" approach instead of depending on the current cycle to finish.
As a result, max latencies are greatly reduced.

Testing Results
===============

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

I have run this with larger sizes as well, and it generally performs
much better than the baseline version. For sizes up to 5MB, I have seen
no timeouts in my setup. After that, I see some timeouts. Buffer
splitting is expected to make this better.

Aside from performance testing, this was also tested with batch and
periodic mode for various requests sizes."
2016-10-20 16:44:39 +03:00
Glauber Costa
d5618c6ace commitlog: add total_operations type for requests_blocked_memory
Current tracker for pending allocations is a queue_size GAUGE.  Add a
total_operations version so we have more insight on what's going on.

It will be called requests_blocked_memory for consistency with other
subsystems that track similar things.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-20 09:25:38 -04:00
Avi Kivity
db2f5e6be1 blocktune: wire up blocktune on startup
Message-Id: <1476357027-15014-3-git-send-email-avi@scylladb.com>
2016-10-20 13:24:05 +03:00
Avi Kivity
098d02ad1a scylla-blocktune: introduce
scylla-blocktune is a script that parses scylla.yaml and tunes the data file
and commitlog directories it references.

Tuning includes:
 - set the I/O scheduler to noop
 - disable merging
 - tune dependent devices (like RAID members)

Message-Id: <1476357027-15014-2-git-send-email-avi@scylladb.com>
2016-10-20 13:24:05 +03:00
Avi Kivity
fad34eef6c scylla_raid_setup: don't mess with read-ahead
It doesn't affect O_DIRECT reads, and it's not persistent.

Message-Id: <1476269082-2473-2-git-send-email-avi@scylladb.com>
2016-10-20 13:23:38 +03:00
Avi Kivity
a837da06ef scylla_raid_setup: increase chunk size
The current chunk size of 256 gives a 50% probability of a 128k read or
write getting split into two accesses.  This reduces efficiency and
increases latency.

Change the chunk size to 1MB, with a 12% probability of cross-member
access.

Message-Id: <1476269082-2473-1-git-send-email-avi@scylladb.com>
2016-10-20 13:23:38 +03:00
Takuya ASADA
80e3d8286c dist/ami: fix incorrect /etc/fstab entry on CentOS7 base image
There was incorrect rootfs entry on /etc/fstab:
 /dev/sda1 / xfs defaults,noatime 1 1
This causes boot error when updated to new kernel.
(see:
https://github.com/scylladb/scylla/issues/1597#issuecomment-250243187)

So replaced the entry to
 UUID=<uuid>  / xfs defaults,noatime 1 1
Also all recent security updates applied.

Fixes #1597
Fixes #1707

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475094957-9464-1-git-send-email-syuu@scylladb.com>
2016-10-20 11:48:24 +03:00
Takuya ASADA
5f602752a5 dist/ubuntu: backport g++-5 from Debian 9(stretch) to Debian 8(jessie)
Since Debian 8(jessie) does not provides g++-5, we frequently got compile error
because we are using older compiler.
To fix the problem, backport g++-5 from Debian 9(stretch).

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-3-git-send-email-syuu@scylladb.com>
2016-10-20 11:41:02 +03:00
Takuya ASADA
7d67504b56 dist/ubuntu: use VERSION_ID from /etc/os-release instead of 'lsb_release -r'
On Debian, lsb_release -r returns the version number something like '8.6'.
However, on this script we want to check major version only.
Therefore we can use VERSION_ID from /etc/os-release which only contains
major version number.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-2-git-send-email-syuu@scylladb.com>
2016-10-20 11:41:02 +03:00
Avi Kivity
0da2f64cfb Merge seastsar upstream
* seastar ccd8649...69acec1 (2):
  > app/iotune: add --smp option
  > rpc: Add missing adjustment of snd_buf::size

Fixes #1767.
Fixes #1768.
2016-10-20 11:16:40 +03:00
Paweł Dziepak
210a390892 tests: add missing sstable for partition skipping test
Commit 7dcd70124a "tests/sstables: add
test for fast forwarding reader" added a test for skipping parts of
sstable. Unfortunately, it did not include the sstables it was trying to
read.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 23:23:49 +01:00
Glauber Costa
1578d7363a commitlog: rework blocking logic
The current incarnation of commitlog establishes a maximum amount of
writes that can be in-flight, and blocks new requests after that limit
is reached.

That is obviously something we must do, but the current approach to it
is problematic for two main reasons:

1) It forces the requests that trigger a write to wait on the current
   write to finish. That is excessive; ideally we would wait for one
   particular write to finish, not necessarily the current one. That
   is made worse by the fact that when a write is followed by a flush
   (happens when we move to a new segment), then we must wait for
   *all* writes in that segment to finish.

1) it casts concurrency in terms of writes instead of memory, which
   makes the aforementioned problem a lot worse: if we have very big
   buffers in flight and we must wait for them to finish, that can
   take a long time, often in the order of seconds, causing timeouts.

The approach taken by this patch is to replace the _write_semaphore
with a request_controller. This data structure will account the amount
of memory used by the buffers and set a limit on it. New allocations
will be held until we go below that limit, and will be released
as soon as this happens.

This guarantees that the latencies introduced by this mechanism are
spread out a lot better among requests and will keep higher percentile
latencies in check.

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:56:36 -04:00
Glauber Costa
aec724bbda commitlog: factor out code for checking mutation size
In a subsequent patch, I'll use this code in a different place. To
prepare for that, we move it out as a method. It also fits a lot better
inside the segment manager, so move it there.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
a50996f376 commitlog: calculate segment-independent size of mutations
Goal is to calculate a size that is lesser or equal than the
segment-dependent size.

This was originally written by Tomasz, and featured in his submission
"commitlog: Handle overload more gracefully"

Extracted here so it sits clearly in a different patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
0b7c9fa17f commitlog: remove _needed_size
It is mostly an optimization, and while it makes sense in this context,
it won't soon as we'll stop waiting for the current cycle specifically
to finish.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
6214bdeb66 commitlog: move segment_manager constructor outside the class definition
We'll do that so we can, in following patches, use static members from
the segment. Those are not defined at this point.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
299877f432 commitlog: add a counter for pending allocations
We track the amount of pending allocations but we don't really export
it. It will be crucial when we stop tracking pending writes.

This patch exports it through a method instead of the totals structure,
so we can easily change it. Current code probing pending_allocations
(the api code) is also converted to use the public method instead of the
totals struct.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Avi Kivity
07c995ab3d Merge "Fast forward mutation readers" from Paweł
"This patchset enables mutation readers to be fast forwarded to a different
partition range. The main reason for introducing such feature are range
queries served from cache. If the cache is partially populated in the
requested range the reader will end up with multiple subranges that have
to be read from the sstables. Originally, each of these subranges would
require a new reader to be created, but with fast forwarding we can have
just one sstable reader. This is better since there is a chance that buffers
kept by the reader may be still useful after fast forwarding it.

In this series there are also patches that clean up cache readers in order
to make integration with fast forwarding easier. Namely, continuity flag is
changed to store information about range before the entry which significantly
simplifies the logic.

Fixes #1299."

* 'pdziepak/fast-forward-mutation-readers/v5' of github.com:cloudius-systems/seastar-dev: (24 commits)
  sstables: keep separate stream history for single and range reads
  sstables: drop sstable::{lower, upper}_bound()
  row_cache: rework cache to use fast forwarding reader
  row_cache: put cache entry flags in a struct
  row_cache: add do_find_or_create_entry() to reduce code duplication
  mutation_reader: forward fast_forward_to() calls
  tests/row_cache: add fast_forward_to() to throttled reader
  tests/row_cache: count mutations read from _underlying
  memtable: add support for fast_forward_to()
  drop key readers
  tests/mutation_reader: test fast forwarding combined reader
  database: enable fast forwarding of range_sstable_reader
  combined_mutation_reader: implement fast_forward_to()
  mutation_reader: make combinded_reader public
  tests/sstables: add test for fast forwarding reader
  tests: add more helpers to mutation reader assertions
  sstables: enable fast forwarding for range readers
  mutation_reader: introduce fast_forward_to()
  sstables: implement mutation_reader::impl::fast_forward_to()
  sstables: introduce index_reader
  ...
2016-10-19 18:10:44 +03:00
Paweł Dziepak
ab0eeae82d sstables: keep separate stream history for single and range reads
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
20bfa1fa52 sstables: drop sstable::{lower, upper}_bound()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5ff699e09f row_cache: rework cache to use fast forwarding reader
This uncomfortably large patch overhauls cache range reader so that it
can take advantage of fast forwarding mutation readers.

A significant change in the cache itself is that the continuity flag now
is used to determine whether cache is contiguous between the previous
entry and the current one. This allows for a significant simplification
of the cache code and easier integration with reader fast forwarding.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
18acb0c0e6 row_cache: put cache entry flags in a struct
Flags are easier to manage if they are in a single structure.
Especially, default initialization and move contstructors are simpler
and less error prone.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f248e23db5 row_cache: add do_find_or_create_entry() to reduce code duplication
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
bcd374c05d mutation_reader: forward fast_forward_to() calls
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0c24bbe639 tests/row_cache: add fast_forward_to() to throttled reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
69645455f3 tests/row_cache: count mutations read from _underlying
Originally, cache tests checked how many times a mutation reader was
created from the underlying mutation source to determine whether
continuity flag is working correctly.

This is not going to work with fast forwarding mutation readers so the
test is switched to count number of mutations (+ end of stream markers)
returned from underlying mutaiton readers which is much less fragile.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
e14f8027d5 memtable: add support for fast_forward_to()
Fast forwarding of memtable readers is needed only for unit tests which
often use memtables as underlying data source for cache and the cache
readers.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
6755a679f6 drop key readers
key_readers weren't used since introduction of continuity flag to cache
entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5ac9babe97 tests/mutation_reader: test fast forwarding combined reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7bebfb851f database: enable fast forwarding of range_sstable_reader
When fast forwarding a reader that combines sstable reader we must also
remember that the set of sstables for the new range may be different
than for the previous one. The reader introduced in this patch makes
sure that we read from correct sstables.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
b7b7b2bd63 combined_mutation_reader: implement fast_forward_to()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2c0cdd55fc mutation_reader: make combinded_reader public
We want to be able to fast forward sstable readers. However, just
implementing fast_forward_to() for combined_reader is not enough as the
sstables we are reading from may need to change.

Following patches are going to introduce a combined sstable reader that
derives from combined_reader. To make that possible we first need to
make combined_reader public.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7dcd70124a tests/sstables: add test for fast forwarding reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5534dc2817 tests: add more helpers to mutation reader assertions
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
cf024975fe sstables: enable fast forwarding for range readers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
62c9492d33 mutation_reader: introduce fast_forward_to()
This patch introduces the interface for fast forwarding mutation
readers. The main user of this feature is going to be cache which, while
serving range query, may need to read multiple small ranges from the
sstables to populate itself with the missing entries.

Fast forwarding is an alternative to recreating a reader with different
range. Its main advantage is fact that it avoids dropping data that has
already been read.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
c63e88d556 sstables: implement mutation_reader::impl::fast_forward_to()
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
a530762277 sstables: introduce index_reader
index_reader is a helper that implements index lookups. Its goal is to
avoid dropping read buffers if they still may be needed (for example to
get end bound of the range or after fast forwarding the reader).

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f49a9e0d64 sstables: drop unused read_range_rows() overload
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0bc873ace5 sstables: add fast_forward_to() to continuous_data_consumer
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
25b91c51e2 ssables: add data_consume_rows_context::reset()
reset() is going to be used to restore valid state after fast forwarding
the reader.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2124d08b88 sstables: add skip() to compressed_file_data_source
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
54069162f5 Merge "Add test for partition version list consistency after compaction" from Tomek 2016-10-18 11:03:25 +01:00
Tomasz Grabiec
308434f891 tests: memtable: Add test for partition version list consistency after compaction 2016-10-18 11:57:14 +02:00
Tomasz Grabiec
6548132423 lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
2016-10-18 11:16:08 +02:00
Tomasz Grabiec
ecf85cbffb mutation: Define + operation
It's more convenient to write m1 + m2 in tests than to do more
elaborate constructs with copy constructors and apply().
2016-10-18 11:16:08 +02:00
Tomasz Grabiec
fe387f8ba0 partition_version: Fix corruption of partition_version list
The move constructor of partition_version was not invoking move
constructor of anchorless_list_base_hook. As a result, when
partition_version objects were moved, e.g. during LSA compaction, they
were unlinked from their lists.

This can make readers return invalid data, because not all versions
will be reachable.

It also casues leaks of the versions which are not directly attached
to memtable entry. This will trigger assertion failure in LSA region
destructor. This assetion triggers with row cache disabled. With cache
enabled (default) all segments are merged into the cache region, which
currently is not destroyed on shutdown, so this problem would go
unnoticed. With cache disabled, memtable region is destroyed after
memtable is flushed and after all readers stop using that memtable.

Fixes #1753.
Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>
2016-10-18 09:25:38 +01:00
Duarte Nunes
1d45f19c78 create_view_statement: Use cf_properties
This patch uses cf_properties instead to add the missing attributes to
the create_view_statement class.

Fixes #1766

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
7c58b7e764 unimplemented: Add materialized views
This patch adds the VIEWS element to the cause enum so we can
mark failures due to incomplete support of materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
7c28ed3dfc schema: Extract default compressor
This patch extracts the definition of the default compressor into the
compression_parameters class, so that the table and view creation
statements don't have to explicitly deal with it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
dc470e6a36 cql3: Extract cf_properties
This patch extracts the cf_properties class, which contains common
attributes of tables and materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:51 +00:00
Takuya ASADA
587d375e19 main: exit with 1 when verify_seastar_io_scheduler() failed
Since we are exiting Scylla process in engine().at_exit() using
::_exit(0), even verify_seastar_io_scheduler() throwing an exception,
scylla always exit with 0.

Systemd misunderstands scylla-server.service was shutdown successfully
because of this, so we need to pass correct exit code to ::_exit() here.

Fixes #1674

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>
2016-10-17 13:57:00 +03:00
Avi Kivity
163088c6af Merge seastar upstream
* seastar 207bf3d...ccd8649 (3):
  > Merge "Augment semaphore with non-blocking operations" from Glauber
  > Merge "More dynamic fstream patches" from Paweł
  > Merge "fstream: add dynamic adjustments based on stream history" from Paweł
2016-10-17 12:49:17 +03:00
Avi Kivity
65c27ccf21 bytes_ostream: make max_chunk_size() an inline function
Fixes debug build looking for a variable definition and not finding it.
2016-10-17 11:49:33 +03:00
Avi Kivity
c0a1ad0b77 bytes_ostream: use larger allocations
A 1MB response will require 2000 allocations with the current 512-byte
chunk size.  Increase it exponentially to reduce allocation count for
larger responses (still respecting the upper limit).
Message-Id: <1476369152-1245-1-git-send-email-avi@scylladb.com>
2016-10-16 10:05:48 +01:00
Tomasz Grabiec
d836e8f64b tests: memtable: Add tests for flushing reader
Message-Id: <1476454187-11462-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:11:06 +01:00
Tomasz Grabiec
63784fd921 db: Fix corruption of partition_entry
Memory accounting code was attaching partition_snapshot to
partition_entry in order to calculate the size of partition_version
object. However, it is only allowed if partition_entry doesn't have
any snapshot attached already. In this case it always has one, created
by the flushing reader.

Change the accounting code to reuse existing partition_snapshot reference.

Fixes #1746
Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:10:48 +01:00
Paweł Dziepak
d08cffd3c7 lsa: avoid exceptions during segment_zone creation
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.

This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.

Fixes #1394.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>
2016-10-14 11:08:24 +02:00
Amnon Heiman
7829da13b4 scylla_setup: Reorder questions and actions
The expected behaviour in the scylla_setup script is that a question
will be followed by the answer.

For example, after asking if the scylla should be run as a service the
relevant actions will be taken before the following question.

This patch address two such mis-orders:
1. the scylla-housekeeping depends on the scylla-server, but the
setup should first setup the scylla-server service and only then ask
(and install if needed) the scylla-housekeeping.
2. The node_exporter should be placed after the io_setup is done.

Fixes #1739

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1476370098-25617-1-git-send-email-amnon@scylladb.com>
2016-10-13 18:29:36 +03:00
Pekka Enberg
3b4e6cdc5e abstract_replication_strategy: Fix exception type if class not found
Change abstract_replication_strategy::create_replication_strategy() to
throw exceptions::configuration_error if replication strategy class
lookup to make sure the error is converted to the correct CQL response.

Fixes #1755

Message-Id: <1476361262-28723-1-git-send-email-penberg@scylladb.com>
2016-10-13 17:39:28 +03:00
Tomasz Grabiec
e617bcd8a7 logalloc: disable abort on allocation failure in places in which it is benign
Some places start big expecting allocation failure, then reduce the
requested size. Let's not abort in such cases.

Message-Id: <1476295120-32047-1-git-send-email-tgrabiec@scylladb.com>
2016-10-13 10:53:32 +03:00
Avi Kivity
13e9d4c8e3 Merge seastar upstream
* seastar f937fb0...207bf3d (11):
  > Merge "iotune: gracefully exit on predictable exceptions" (Fixes #1623)
  > core/semaphore: Add semaphore_units::release()
  > Merge "rometheus API with grafana uses labels" from Amnon
  > core/thread: Fix stack alloc-dealloc mismatch
  > core/thread: Make jmp_buf_link::yield_at use the same time point as thread_scheduling_group
  > file: support for XFS on older kernels
  > reactor: fix bug when handling EBADF in flush_pending_aio()
  > prometheus CPU should start in 0
  > Collectd: bytes ordering depends on the type
  > tests: Check that backtrace() doesn't corrupt signal mask
  > core/thread: Add stack guards to seastar thread stacks
2016-10-12 23:47:12 +03:00
Avi Kivity
63f053e9b7 storage_proxy: fix mutation reordering with wrapping ranges
If we have a range query involving a wrapping range (i.e., from thrift),
and mutations from both halves of the result are involved, then
we will return the results in the wrong order (and potentially the wrong
partitions) since we order by token, so the results from the second half
of the wrapping range end up before the first.

Fix by splitting the two queries, and merging the second half with lower
priority compared to the first half.

Note: this will be fixed in a better way once we have the sharding iterator,
as then we can query sequentially.

Fixes #1761.
Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>
2016-10-12 15:59:16 +02:00
Avi Kivity
1506b06617 Merge "node_exporter service on ubuntu 16" from Amnon
"This series address two issues that interfere with running the node_exporter as a service in ubuntu 16.
1. The service file should be packed in the deb file
2. When setting the node_exporter as a service it doesn't need to run with scylla use"

* 'amnon/node_exporter_ubuntu_v2' of github.com:cloudius-systems/seastar-dev:
  node-exporter service: No need to run as scylla user
  debian package: Include the node_exporter service file
2016-10-12 12:11:18 +03:00
Amnon Heiman
1bd50789e0 node-exporter service: No need to run as scylla user
the node-exporter does not need to run as scylla user.  It can run
without scylla or without the scylla user being configure.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-11 12:44:27 +03:00
Amnon Heiman
d523bf56ed debian package: Include the node_exporter service file
This will include the node_exporter service script for ubuntu
distribution with systemd support.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-11 12:44:14 +03:00
Avi Kivity
f6998bb260 Merge "Implement describe_splits_ex based on Cassandra" from Duarte
"This patch-set re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693"

* 'describe-splits/v2' of github.com:duarten/scylla:
  thrift: Implement describe_splits_ex based on Cassandra
  storage_service: Implement get_splits() function
  sstables: Add function to get key samples
  sstables/key: Add to_partition_key function
  size_estimates_recorder: Increase estimate accuracy
  sstables: Get estimates for a particular range
  sstables/key: Make key::kind public
2016-10-11 11:13:35 +03:00
Takuya ASADA
0007f2d838 dist/common/sbin: add scylla_cpuset_setup and scylla_dev_mode_setup to /usr/sbin
We haven't added symlinks to /usr/sbin for newly created scripts, so add them.

Fixes #1702

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1474879711-31793-1-git-send-email-syuu@scylladb.com>
2016-10-11 11:02:14 +03:00
Takuya ASADA
ccad720bb1 dist/common/script/scylla_io_setup: handle comma correctly when parsing cpuset
The script mistakenly split value at "," when cpuset list is separated
by comma. Instead of matching possible patterns of the argument, let's
pass all characters until reach to space delimiter or end of line.

Fixes #1716

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476171037-32373-1-git-send-email-syuu@scylladb.com>
2016-10-11 10:42:32 +03:00
Duarte Nunes
d8cfc56376 thrift: Implement describe_splits_ex based on Cassandra
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 22:32:10 +02:00
Duarte Nunes
01ab2081cd storage_service: Implement get_splits() function
This patch implements the get_splits() function in storage_service,
used to split a particular token range in slices of approximately the
specified size, using the sample keys and estimates of the CF's
sstables.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 22:32:08 +02:00
Duarte Nunes
c36dbaf0f1 sstables: Add function to get key samples
This patch implements the get_key_samples() function, on which a
future patch will base an implementation of the describe_splits()
thrift verb closer to Cassandra's.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:14 +02:00
Duarte Nunes
fc07b66678 sstables/key: Add to_partition_key function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:11 +02:00
Duarte Nunes
c19c633299 size_estimates_recorder: Increase estimate accuracy
This patch uses the estimated_keys_for_range() function to get better
estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:16 +02:00
Duarte Nunes
ceed09b23e sstables: Get estimates for a particular range
This patch adds the estimated_keys_for_range() function, which
estimates the number of keys present between the specified range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:15 +02:00
Duarte Nunes
8c223b31c8 sstables/key: Make key::kind public
Needed to create synthetic keys without any value but with ordering
properties.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:47:24 +02:00
Avi Kivity
b305d92a65 Merge "housekeeping: check version during setup" from Amnon
"The version is taken from the installation rather than the API, a mode command
line indicated that this is part of the setup and uuid is used for the
interaction with the checkversion server."

* 'amnon/check_version_on_startup_v3' of github.com:cloudius-systems/seastar-dev:
  scylla_setup: Check and report the scylla version
  scylla-housekeeping: check version during setup
2016-10-10 16:37:14 +03:00
Vlad Zolotarov
ab748e829d docs: tracing.md: initial commit
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1475686745-20383-1-git-send-email-vladz@cloudius-systems.com>
2016-10-10 16:12:02 +03:00
Tomasz Grabiec
4357d0a6d9 db: Add counter for writes blocked on dirty memory
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.

total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.

Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
2016-10-10 14:25:22 +03:00
Pekka Enberg
3b75ff1496 docs/docker: Tag --listen-address as 1.4 feature
The Docker Hub documentation is the same for all image versions. Tag
`--listen-address` as 1.4 feature.

Message-Id: <1475819164-7865-1-git-send-email-penberg@scylladb.com>
2016-10-10 13:26:16 +03:00
Vlad Zolotarov
006999f46c api::storage_service::slow_query: don't use duration_cast in GET
The slow_query_record_ttl() and slow_query_threshold() return the duration
of the appropriate type already - no need for an additional cast.
In addition there was a mistake in a cast of ttl.

Fixes #1734

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1475669400-5925-1-git-send-email-vladz@cloudius-systems.com>
2016-10-09 18:09:13 +03:00
Takuya ASADA
469e9af1f4 dist/common/scripts/scylla_setup: use 'swapon -s' instead of 'swapon --show'
Since Ubuntu 14.04 doesn't supported --show option, we need to prevent use it.
Fixes #1740

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475788340-22939-2-git-send-email-syuu@scylladb.com>
2016-10-09 18:05:14 +03:00
Takuya ASADA
8452045b85 dist/ubuntu: add realpath to dependency, requires for scylla_setup
We need dependency to realpath, since scylla_setup using it.

Fixes #1740.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475788340-22939-1-git-send-email-syuu@scylladb.com>
2016-10-09 18:05:14 +03:00
Tomasz Grabiec
41e66ebce2 gdb: Introduce 'scylla heapprof'
Presents current heap profile recording.

Works in text mode or dumps to collapsed stacks format from which
flame graph can be generated.

To generate a flamegraph:

  (gdb) scylla heapprof --flame
  Wrote heapprof.stacks

  $ flamegraph.pl --colors mem < heapprof.stacks > heapprof.svg

flamegraph.pl comes from:

  https://github.com/brendangregg/FlameGraph.git

Text mode example:

(gdb) scylla heapprof --min 100000000
All (274699676, #10213)
 \-- void* memory::cpu_pages::allocate_large_and_trim<memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}>(unsigned int, memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}) + 169  (268435456, #1)
     memory::allocate_large_aligned(unsigned long, unsigned long) + 87
     memory::allocate_aligned(unsigned long, unsigned long) + 48
     aligned_alloc + 9
     logalloc::segment_zone::segment_zone() + 304
     logalloc::segment_pool::allocate_segment() + 477
     logalloc::segment_pool::segment_pool() + 304
     __tls_init.part.801 + 72
     logalloc::region_group::release_requests() + 1333
     logalloc::region_group::add(logalloc::region_group*) + 514

The branches are formatted like this:

   -- <symbol> (<size>, #<count>)

Where <size> is total size of live objects and <count> is total
number of live objects, for all objects allocated from paths going
through this node.

Nodes which share the same <size> and <count> are stacked like this:

   -- <symbol_1> (<size>, #<count>)
      <symbol_2>
      <symbol_3>

Message-Id: <1475583334-19524-1-git-send-email-tgrabiec@scylladb.com>
2016-10-09 10:54:08 +03:00
Glauber Costa
33e9c2bbdd memtable: reduce sstable flush concurrency to one
Limiting the concurrency of memtable flushes to 4 was a temporary
workaround for the fact that we lacked good write behind support. Now
that write behind is properly merged we can reduce the concurrency to
what it should be, one.

This means that memtable flushes will now be serialized, and only when
one of them ends will the next one begin. Disk parallelism is obtained
through the write-behind mechanism.

Fixes #1373

Signed-off-by: Glauber Costa <glauber@scylladb.com>

Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>
2016-10-09 10:48:57 +03:00
Tomasz Grabiec
2a5a90f391 db: Do not timeout streaming readers
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.

This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.

The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.

Fixes #1741.

Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
2016-10-07 15:41:04 +03:00
Raphael S. Carvalho
9175977a9d cql3: fix build failure by defining out unused function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <cba6207278ea945ee750d78b189320443843a288.1475793747.git.raphaelsc@scylladb.com>
2016-10-07 08:45:18 +03:00
Avi Kivity
9ac441d3b5 range: adjust split_after to allow split_point outside input range
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range.  If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.

"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.

This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
2016-10-06 17:54:44 +02:00
Raphael S. Carvalho
7ea4513595 database: trigger compaction after loading new sstables
Scylla wasn't trying to compact new sstables uploaded via 'nodetool
refresh'. Thus, all new sstables were left uncompacted until user
issued 'nodetool flush' or a new sstable was written which would
trigger compaction too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:49 +03:00
Raphael S. Carvalho
9c59ccc52a storage_service: improve log message for refresh
'No new SSTables were found for keyspace1.standard1' was printed
if user uploaded new sstables to upload dir instead, and that is
confusing. We should instead print that if new sstables weren't
found in both cf and cf/upload dirs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <90386f6255407697434213227ae7ff0de7464f99.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:32 +03:00
Raphael S. Carvalho
76862d0d9c main: start compaction procedure after commit log is replayed
Commit log replay is a synchronous operation in bootstrap, so services
will only be started after it's completed. By starting compaction before,
less bandwidth will be available to both and consequently boot will be
slowed down. Fix is simply about moving compaction, which is an
asynchronous operation after commitlog replay is over.

Fixes #1620.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d2a173a4ee4d474317b970c6b39530e61067fea9.1475527955.git.raphaelsc@scylladb.com>
2016-10-06 18:25:24 +03:00
Nadav Har'El
ee7ec10b11 CQL parser: "CREATE MATERIALIZED VIEW" statement
This patch adds the parsing for the "CREATE MATERIALIZED VIEW" statement,
following Cassandra 3 syntax. For example:

   CREATE MATERIALIZED VIEW building_by_city
   AS SELECT * FROM buildings
   WHERE city IS NOT NULL
   PRIMARY KEY(city, name);

It also adds the "IS NOT NULL" operator needed for this purpose.
As in Cassandra, "IS NOT NULL" can only be used for materialized
view creation, and not in a normal SELECT. It can only be used with
the NULL operand (i.e., "IS NOT 3" will be a syntax error).

The current implementation of this statement just does some sanity
checking (such as to verify that "city" is a valid column name and that
the "building" base table exists), complains that materialized views are
not yet supported:

SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="Failed parsing statement: [CREATE MATERIALIZED VIEW building_by_city AS
SELECT * FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);] reason: unsupported operation: Materialized views not yet supported">

As mentioned above, the "IS NOT NULL" restriction is not allowed in
ordinary selects not creating a materialized views:

SELECT * FROM buildings WHERE city IS NOT NULL;
InvalidRequest: code=2200 [Invalid query] message="restriction 'city IS NOT null' is only supported in materialized view creation"

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1475742927-30695-1-git-send-email-nyh@scylladb.com>
2016-10-06 15:42:37 +03:00
Glauber Costa
7146776d7c fix sstable tests by not using the flush_reader if no region_group
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.

Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.

If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.

One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.

The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
2016-10-05 12:44:21 +01:00
Avi Kivity
c94fb1bf12 build: reduce inclusions of messaging_service.hh
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.

Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
2016-10-05 11:46:49 +03:00
Avi Kivity
f8118d9fc2 Merge "Virtual dirty memory management" from Glauber
"Description:
============

Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Results
=======

With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.

In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.

For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ

Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed.  I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.

Known Issues:
-------------

A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.

After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."

* 'virtual-dirty-v6' of github.com:glommer/scylla:
  database: allow virtual dirty memory management
  streamed_mutation: make _buffer private
  add accounting of memory read to partition_snapshot_reader
  move partition_snapshot_reader code to header file
  LSA: allow a group to query its own region group
  memtables: split scanning reader in two
  sstables: use special reader for writing a memtable
  LSA: export information about object memory footprint
  LSA: export information about size of the throttle queue
  database: export virtual dirty bytes region group
2016-10-04 20:57:52 +03:00
Avi Kivity
cc33c8b4ba Merge seastar upstream
* seastar 18f7bb8...f937fb0 (5):
  > Merge "Fix signal mask corruption" from Tomasz
  > core/memory: Avoid violating strict aliasing when accessing allocation sites
  > core/memory: Avoid indirection when storing allocation sites
  > core/memory: Add a way to disable abort on allocation failure in some scope
  > core/sharded: Allow mapper to take the service by non-const reference
2016-10-04 20:08:57 +03:00
Glauber Costa
f89a67c75c database: allow virtual dirty memory management
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
7b6e8a2526 streamed_mutation: make _buffer private
It is currently protected, but now all users go through
push_mutation_fragment().  So we can safely move its visibility to guarantee
that it stays that way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
1db245b52d add accounting of memory read to partition_snapshot_reader
By default, we don't do any accounting. By specializing this class and providing
an accounter class, we can account how much memory are we reading as we read
through the elements.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
452eb95943 move partition_snapshot_reader code to header file
This is so we can template it without worrying about declaring the
specializations in the .cc file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
86aa0b830d LSA: allow a group to query its own region group
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
eee15578fb memtables: split scanning reader in two
The code that is common will live in its own reader, the iterator_reader.  All
friendly private access to memtable attributes and methods happen through the
iterator reader.

After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
16886eeb96 sstables: use special reader for writing a memtable
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
28e3f2f6ee LSA: export information about object memory footprint
We allocate objects of a certain size, but we use a bit more memory to hold
them.  To get a clerer picture about how much memory will an object cost us, we
need help from the allocator. This patch exports an interface that allow users
to query into a specific allocator to get that information.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Pekka Enberg
c3bebea1ef dist/docker: Add '--listen-address' to 'docker run'
Add a '--listen-address' command line parameter to the Docker image,
which can be used to set Scylla's listen address.

Refs #1723

Message-Id: <1475485165-6772-1-git-send-email-penberg@scylladb.com>
2016-10-04 13:57:55 +03:00
Marius
876775a52c dist/docker/ubuntu: refactored $IP/listen_address
In order to allow Scylla’s docker container to handle multiple network
interfaces, the start-scylla script was refactored:

- `$IP` is now called `$SCYLLA_LISTEN_ADDRESS`, so it is less likely to
   be confused or interfere with other environment variables.
- `$SCYLLA_LISTEN_ADDRESS` now checks its value and also tries to
   resolve a hostname, if no IP was set to it.
- `$SCYLLA_LISTEN_DEVICE` can now be set as environment variable and
   contain any available NIC device name (e.g. `eth0`). The script
   automatically retrieves the IP address from the device.

Usage:

1. With `$SCYLLA_LISTEN_ADDRESS` as IP:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=192.168.1.100 scylladb/scylla`

2. With `$SCYLLA_LISTEN_ADDRESS` as hostname:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=containername.network.lan scylladb/scylla`

3. With `$SCYLLA_LISTEN_DEVICE`:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_DEVICE=eth0 scylladb/scylla`

Message-Id: <20161003151230.67672-1-marius@twostairs.com>
2016-10-04 13:56:55 +03:00
Raphael S. Carvalho
747b42299c database: remove unused code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>
2016-10-04 09:26:43 +03:00
Paweł Dziepak
7599ef6fde query_pager: fix splitting range at the end bound
Currently, the code responsible for calculating ranges for the next
request could produce a wrap-around partition range. For example, if the
original range was (unimportant, A] and the last partition key A then
the output range would be (A, A].

This patch adds checks to make sure that in such cases the range is
removed.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475497244-2790-1-git-send-email-pdziepak@scylladb.com>
2016-10-03 19:33:42 +02:00
Avi Kivity
8747054d10 exceptions: mark function called before construction static
cassandra_exception::prepare_message() is called from derived classes'
constructors before the base cassnadra_exception object is constructed.
This is technically illegal but harmless.  Fix by marking the function
static.

Found by clang.
2016-10-03 16:29:02 +03:00
Calle Wilund
5b815b81b4 auth::password_authenticator: Ensure exceptions are processed in continuation
Fixes #1718 (even more)
Message-Id: <1475497389-27016-1-git-send-email-calle@scylladb.com>
2016-10-03 14:49:59 +02:00
Pekka Enberg
f3cd21c8f1 Merge seastar upstream
* seastar 0e60722...18f7bb8 (1):
  > core/memory: Fix compilation errors
2016-10-03 12:54:38 +03:00
Calle Wilund
d24d0f8f90 auth::password_authenticator: "authenticate" should not throw undeclared excpt
Fixes #1718

Message-Id: <1475487331-25927-1-git-send-email-calle@scylladb.com>
2016-10-03 12:53:30 +03:00
Avi Kivity
a51804eca8 Merge "token_restriction: Deal with minimum tokens" from Duarte
"This patch set ensures we can correctly handle queries
where the minimum token is specified."

* 'min-token/v3' of github.com:duarten/scylla:
  cql_query_test: Add test case for min/max token bounds
  token_restriction: Deal with minimum tokens
  partitioner: Parse token from bytes
2016-10-02 12:32:40 +03:00
Avi Kivity
5071f4c0bf Merge seastar upstream
* seastar 9e1d5db...0e60722 (9):
  > core/memory: Replace assert with bad_alloc in allocate_large()
  > chunked_fifo: avoid direct use of sized operator delete
  > memory: fix build without heap profiler
  > xen: initialize port::_sem
  > Merge "Make input streams skippable" from Paweł
  > semaphore: require explict setting for start value
  > prometheus: remove invalid chars from meric names
  > core/memory: Introduce heap profiler
  > util/backtrace: Mark noexcept if func() doesn't throw
2016-10-02 11:43:22 +03:00
Vlad Zolotarov
7e180c7bd3 tracing: introduce the tracing::global_trace_state_ptr class
This object, similarly to a global_schema_ptr, allows to dynamically
create the trace_state_ptr objects on different shards in a context
of the original tracing session.

This object would create a secondary tracing session object from the
original trace_state_ptr object when a trace_state_ptr object is needed
on a "remote" shard, similarly to what we do when we need it on a remote
Node.

Fixes #1678
Fixes #1647

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>
2016-10-02 11:31:37 +03:00
Amnon Heiman
a83bd900be scylla_setup: Check and report the scylla version
This patch adds a call to the scylla-housekeeping check version during
setup, so a warning will be printed if a newer version is available.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-02 11:11:07 +03:00
Amnon Heiman
5e3ab32365 scylla-housekeeping: check version during setup
This changes are for running scylla during setup.

It contains the following changes:
1. get the current version from the command line (as the syclla does not
run at this stage).
2. It support a mode parameter in the command line to indicate that we
running during the installation.
3. It accept an external uuid that will be used with all interaction
with the check_version server.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-02 11:11:07 +03:00
Takuya ASADA
15b156c9d4 dist/common/scripts/scylla_io_setup: describe how to set developer mode when validation tests failed
Describe how to set developer mode, not to confuse users.
Fixes #1701

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475167584-18092-1-git-send-email-syuu@scylladb.com>
2016-10-02 10:58:38 +03:00
Avi Kivity
58ddfea18f Merge "Fixes for leveled compaction strategy" from Raphael
* 'lcs_fixes' of github.com:raphaelsc/scylla:
  lcs: fix starvation at higher levels
  lcs: fix broken token range distribution at higher levels
2016-10-01 21:34:21 +03:00
Takuya ASADA
9639cc840e dist/redhat: add missing build time dependency for libunwind
There was missing dependency for libunwind, so add it.
Fixes #1722

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475260099-25881-1-git-send-email-syuu@scylladb.com>
2016-09-30 21:33:39 +03:00
Takuya ASADA
c89d9599b1 dist/ubuntu: add missing build time dependency for libunwind
There was missing dependency for libunwind, so add it.
Fixes #1721

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475255706-26434-1-git-send-email-syuu@scylladb.com>
2016-09-30 21:33:21 +03:00
Raphael S. Carvalho
a8ab4b8f37 lcs: fix starvation at higher levels
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)

Fixes #1720.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:49 -03:00
Raphael S. Carvalho
a3bf7558f2 lcs: fix broken token range distribution at higher levels
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.

Fixes #1719.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:16 -03:00
Paweł Dziepak
eb1fcf3ecc query_pagers: fix clustering key range calculation
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.

Refs #1446.
Fixes #1684.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
2016-09-30 17:32:59 +02:00
Tomasz Grabiec
7e25b958ac transport: Extend request memory footprint accounting to also cover execution
CQL server is supposed to throttle requests so that they don't
overflow memory. The problem is that it currently accounts for
request's memory only around reading of its frame from the connection
and not actual request execution. As a result too many requests may be
allowed to execute and we may run out of memory.

Fixes #1708.
Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>
2016-09-30 14:23:14 +01:00
Duarte Nunes
72af476397 cql_query_test: Add test case for min/max token bounds
This patch adds a test case for specifying the minimum and maximum
tokens in a cql3 query.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:45:45 +00:00
Duarte Nunes
98b4814894 token_restriction: Deal with minimum tokens
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref #1139
Ref #693
Fixes #1717

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:08 +00:00
Duarte Nunes
862f51cddf partitioner: Parse token from bytes
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:02 +00:00
Duarte Nunes
0c8f280af7 partition_key_view: Implement operator<<
The operator is declared, but it isn't implemented. This patch fixes
that.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1475225647-3800-1-git-send-email-duarte@scylladb.com>
2016-09-30 10:54:54 +02:00
Duarte Nunes
a36888f3cb storage_service: Convert token through partitioner
This patch ensures we use the partitioner to convert a token to
sstring instead of casting.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1475179683-28552-1-git-send-email-duarte@scylladb.com>
2016-09-30 10:54:26 +02:00
Tomasz Grabiec
91b1bada55 Merge seastar upstream
* seastar 5b7252d...9e1d5db (5):
  > prometheus: prevent illegal prometheus names
  > scollectd: raw_to_value should not use network order
  > semaphore: Introduce get_units()
  > core::scollectd: truncate the identifiers fields on a 63 characters boundary
  > Merge "Fix ASAN errors in debug builds" from Tomasz
2016-09-29 13:23:24 +02:00
Asias He
511f8aeb91 gossip: Do not remove failure_detector history on remove_endpoint
Otherwise a node could wrongly think the decommissioned node is still
alive and not evict it from the gossip membership.

Backport: CASSANDRA-10371

7877d6f Don't remove FailureDetector history on removeEndpoint

Fixes #1714
Message-Id: <f7f6f1eec2aab1b97a2e568acfd756cca7fc463a.1475112303.git.asias@scylladb.com>
2016-09-29 13:00:47 +03:00
Asias He
a6d6341627 streaming: Add total_{incoming,outgoing}_bytes collectd metrics
It reflects number of bytes sent or received per second in streaming.

To use it:

$ tools/scyllatop/scyllatop.py "*streaming*"

Refs #1655
Message-Id: <5f7943cb2b459db5ed4bd8d7365532ea201ad2d9.1475116963.git.asias@scylladb.com>
2016-09-29 11:54:32 +02:00
Asias He
a6529ad582 repair: Fix split_and_add
Before: the range is split only once, so it is split into 2 sub ranges

  INFO  2016-09-29 15:52:43,625 [shard 0] repair - target_partitions=100, estimated_partitions=537, ranges.size=2,
  range=(8993553141924659802, 8997061146192366917] ->
  ranges={
  (8993553141924659802, 8995307144058513359], (8995307144058513359, 8997061146192366917]}

After: the range is split mulitple times, resulting 16 sub ranges.

  INFO  2016-09-29 15:55:07,934 [shard 0] repair - target_partitions=100, estimated_partitions=67, ranges.size=16,
  range=(8993553141924659802, 8997061146192366917] ->
  ranges={
  (8993553141924659802, 8993772392191391496], (8993772392191391496, 8993991642458123191],
  (8993991642458123191, 8994210892724854885], (8994210892724854885, 8994430142991586580],
  (8994430142991586580, 8994649393258318274], (8994649393258318274, 8994868643525049969],
  (8994868643525049969, 8995087893791781664], (8995087893791781664, 8995307144058513359],
  (8995307144058513359, 8995526394325245053], (8995526394325245053, 8995745644591976748],
  (8995745644591976748, 8995964894858708443], (8995964894858708443, 8996184145125440138],
  (8996184145125440138, 8996403395392171832], (8996403395392171832, 8996622645658903527],
  (8996622645658903527, 8996841895925635222], (8996841895925635222, 8997061146192366917]}

Without this patch, repair can do checksum with a range with a lot of
partitions, not the expected less than 100 partitions per checksum. This
can lead to unncessary data transfer since the checksum is too coarse.
For instacne, as above, if the checksum of 1 out of 537 partitions is
different, the whole 527 partitions will be synced.

Fixes #1613
Message-Id: <0775c20c485c105df5f10bd685048227f074c365.1475137029.git.asias@scylladb.com>
2016-09-29 10:09:25 +01:00
Pekka Enberg
20dccb4bf7 transport/server: Fix CQL Snappy compression failure
The snappy_compress() function expects the "compressed_length" parameter
to contain the actual output buffer length but now we're passing random
garbage from the stack.

Fixes #1711
Message-Id: <1475132127-316-1-git-send-email-penberg@scylladb.com>
2016-09-29 09:29:51 +01:00
Asias He
774d16306f gossip: Use lowres_clock for scheduled_gossip_task
The timer is fired once per second. Using low resolution clock is enough.
Message-Id: <1f21514e975afea6ac5c9dde18a881a41561da70.1475130948.git.asias@scylladb.com>
2016-09-29 10:03:14 +03:00
Piotr Jastrzebski
1948ec8061 Update README.md
Add --init to git submodules update.
It's needed for fmt.
Add libunwind-devel dependency do dnf install.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4918237f91d985649c195035c02b2dd9e9a1ff68.1475087373.git.piotr@scylladb.com>
2016-09-29 10:02:34 +03:00
Gleb Natapov
32989d1e66 Merge seastar upstream
* seastar 2b55789...5b7252d (3):
  > Merge "rpc: serialize large messages into fragmented memory" from Gleb
  > Merge "Print backtrace on SIGSEGV and SIGABRT" from Tomasz
  > test_runner: avoid nested optionals

Includes patch from Gleb to adapt to seastar changes.
2016-09-28 17:34:16 +03:00
Pekka Enberg
9ea24c9d2b Merge "repair: less stream_plan and less streaming traffic" from Asias
"This series improves repair by

 1) using less streaming sessions

 2) reducing unnecessary streaming traffic

 3) fixing a hang during shutdown

 See commit log for "repair: Reduce stream_plan usage", "repair: Reduce
 unnecessary streaming traffic" and "streaming: Fail streaming sessions
 during shutdown" for details.

 Tested with repair_additional_test.py."
2016-09-28 09:54:15 +03:00
Glauber Costa
f5fd6bd714 LSA: export information about size of the throttle queue
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.

This counter is named for consistency after similar counters from transport/.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
Glauber Costa
aa6a96d09b database: export virtual dirty bytes region group
Currently, we export the region group where memtables are placed as dirty bytes.
Upcoming patches will optimistically mark some bytes in this region as free, a
scheme we know as "virtual dirty".

We are still interested in knowing the real state of the dirty region, so we
will keep track of the bytes virtually freed and split the counters in two.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
Gleb Natapov
c95df8f053 messaging_service: use correct value for listen_to_bc_address is a constructor used by tests
Also make sure to not listen on the same exact address twice in case
listen_address == broadcast_address. Scylla configuration code does not
allow such thing to be configured, but better to be safe.

Message-Id: <20160927102316.GO32178@scylladb.com>
2016-09-27 11:27:23 +01:00
Pekka Enberg
e35166af10 Merge "gossip: Fix expire_time for gossip membership removal" from Asias
"We currently use steady_lock which is not consistent on nodes in the cluster.
 Use system_clock for it.

 Fixes #1704"
2016-09-27 11:46:09 +03:00
Asias He
1292341d77 gossip: Improve the expire time logging
Print when the node will be removed from gossip membership, e.g.,

INFO  2016-09-27 08:54:49,262 [shard 0] gossip - Node 127.0.0.3 will be
removed from gossip at [2016-09-30 08:54:48]: (expire = 1475196888294489339,
now = 1474937689262295270, diff = 259199 seconds)
2016-09-27 16:42:35 +08:00
Asias He
f0d3084c8b gossip: Switch to use system_clock
The expire time which is used to decide when to remove a node from
gossip membership is gossiped around the cluster. We switched to steady
clock in the past. In order to have a consistent time_point in all the
nodes in the cluster, we have to use wall clock. Switch to use
system_clock for gossip.

Fixes #1704
2016-09-27 16:42:13 +08:00
Avi Kivity
bfa9aa5d23 Merge "Installing the node_exporter" from Amnon
"The prometheus project and its sub project does not have RPM/DEB packaging yet,
but it does have binaries for download.

This series adds an installation script that download install and run as a
service the node_exporter. For os that uses systemd it has a spec file ready
that will be package with the system. For ubuntu a service file will be created
when running the installer.

After this series running node_exporter_install a node_exporter will be running
as a service on the machine."
2016-09-27 11:00:00 +03:00
Asias He
802c25e67b repair: Switch to use make_streaming_reader in checksum calculation
In patch ac619820 (streaming: Switch to use make_streaming_reade), we
switched to use make_streaming_reader for streaming. In repair, the
checksum phases also uses a mutation reader. For the same reasons (no
pollution to row cache, bounded new data after the reader is created),
switch repair checksum calculation to use the make_streaming_reader too.

Fixes #382
Fixes #1682

Message-Id: <9e0ecda861bb0b6f690da5e2378b208159ffa41c.1474933195.git.asias@scylladb.com>
2016-09-27 10:58:31 +03:00
Tomasz Grabiec
c03568d687 Merge tag 'asias/read_data_from_sstable_in_streaming/v2' from seastar-dev.git
From Asias:

With this series, streaming and repair are improved:

    - streaming, repair will not pollute the row cache on the sender side
      any more. Currently, we are risking evicting all the frequently-queried
      partitions from the cache when an operation like repair reads entire
      sstables and floods the row cache with swathes of cold data from they
      read from disk.

    - less data will be sent becasue the reader will only return existing
      data before the point of the reader is created, plus bounded amount
      of writes which arrive later. This helps reducing the streaming time
      in the case new data is being inserted all the time while streaming is
      in progress. E.g., adding a new node while there is a lot of cql write
      workload.

Fixes #382 and #1682
2016-09-26 11:30:12 +02:00
Asias He
ac6198208b streaming: Switch to use make_streaming_reader
Using make_streaming_reader for streaming on the sender side, it has
the following advantages:

- streaming, repair will not pollute the row cache on the sender side
  any more. Currently, we are risking evicting all the frequently-queried
  partitions from the cache when an operation like repair reads entire
  sstables and floods the row cache with swathes of cold data from they
  read from disk.

- less data will be sent becasue the reader will only return existing
  data before the point of the reader is created, plus bounded amount
  of writes which arrive later. This helps reducing the streaming time
  in the case new data is being inserted all the time while streaming is
  in progress. E.g., adding a new node while there is a lot of cql write
  workload.

Fixes #382
Fixes #1682
2016-09-26 16:12:56 +08:00
Asias He
b505e34062 database: Introduce make_streaming_reader
The make_streaming_reader returns a combined mutation reader reads
mutations from sstables and memtable. The memtable reader handles
memtable flushing automatically so no special handling is needed here.

It will be used by streaming soon.
2016-09-26 16:02:48 +08:00
Asias He
e5a5a9ba15 repair: Rename sync_ranges to request_transfer_ranges
To refelct the fact that the function does not sync the ranges but add
the ranges to request from peer or transfer to peer.
2016-09-26 16:00:07 +08:00
Takuya ASADA
d38aa6570f dist/common/scripts/scylla_setup: do not ask to select disks when there's no free disk
When there's no free disk, it asks to select disks from empty list:

"Please select disks from following list:
type 'done' to finish selection. selected:"

We should avoid to ask it, abort RAID setup instead.

Fixes #1673

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1474429218-28382-1-git-send-email-syuu@scylladb.com>
2016-09-26 08:53:54 +03:00
Gleb Natapov
26ae8e8365 implement listen_on_broadcast_address option
When using multiple physical network interfaces, set this to true to
listen on broadcast_address in addition to the listen_address, allowing
nodes to communicate in both interfaces.  Ignore this property if the
network configuration automatically routes between the public and
private networks such as EC2.

Message-Id: <20160921094810.GA28654@scylladb.com>
2016-09-26 08:49:54 +03:00
Asias He
f377a3b7ac streaming: Fail streaming sessions during shutdown
Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test

The test does:

- Insert data on node1 only
- Insert data on node2 only
- Run repair on node1 and stop node1
  once "starting user-requested repair" is seen

The repair shutdown code may wait for the stream session to complete for
a very long time if node 1 finishes sending data to node2 and is waiting
for node2 to send data to it, when node1 is stopped. The stream session
will not be closed in this case until stream session _keep_alive_timeout
(10 minutes) expires. Instead of waiting for the stream_session keep
alive timer to expire, we can fail all the stream sessions during
shutdown.

Before 1 - The bad case (repair shutdown will last for 10 minutes):

  INFO  2016-09-21 16:23:56,617 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] Executing streaming plan for repair-in
  INFO  2016-09-21 16:23:56,617 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] Starting streaming to 127.0.0.2
  INFO  2016-09-21 16:23:56,617 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] Beginning stream session with 127.0.0.2
  INFO  2016-09-21 16:23:56,618 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] Prepare completed with 127.0.0.2. Receiving 1, sending 0
  INFO  2016-09-21 16:23:58,625 [shard 0] storage_service - Stop transport: stop_gossiping done
  INFO  2016-09-21 16:23:58,625 [shard 0] storage_service - Thrift server stopped
  INFO  2016-09-21 16:23:58,625 [shard 0] storage_service - CQL server stopped
  INFO  2016-09-21 16:23:58,625 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
  INFO  2016-09-21 16:23:58,626 [shard 0] storage_service - messaging_service stopped
  INFO  2016-09-21 16:23:58,626 [shard 0] storage_service - Stop transport: shutdown messaging_service done
  INFO  2016-09-21 16:23:58,626 [shard 0] storage_service - Stop transport: auth shutdown
  INFO  2016-09-21 16:23:58,626 [shard 0] storage_service - Stop transport: done
  INFO  2016-09-21 16:23:58,626 [shard 0] storage_service - Drain on shutdown: stop_transport done
  INFO  2016-09-21 16:23:58,626 [shard 0] tracing - Asked to shut down
  INFO  2016-09-21 16:23:58,626 [shard 0] tracing - Tracing is down
  INFO  2016-09-21 16:23:58,626 [shard 1] tracing - Asked to shut down
  INFO  2016-09-21 16:23:58,626 [shard 1] tracing - Tracing is down
  INFO  2016-09-21 16:23:58,626 [shard 0] storage_service - Drain on shutdown: tracing is stopped
  INFO  2016-09-21 16:23:58,669 [shard 0] storage_service - Drain on shutdown: flush column_families done
  INFO  2016-09-21 16:23:58,669 [shard 0] storage_service - Drain on shutdown: shutdown commitlog done
  INFO  2016-09-21 16:23:58,669 [shard 0] storage_service - Drain on shutdown: done
  INFO  2016-09-21 16:23:58,669 [shard 0] repair - Starting shutdown of repair
  INFO  2016-09-21 16:25:56,624 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] The session 0x600021516c00 made no progress with peer 127.0.0.2

Before 2 - The good case:

  INFO  2016-09-21 16:18:32,087 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Executing streaming plan for repair-in
  INFO  2016-09-21 16:18:32,087 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Starting streaming to 127.0.0.2
  INFO  2016-09-21 16:18:32,087 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Beginning stream session with 127.0.0.2
  INFO  2016-09-21 16:18:32,087 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Prepare completed with 127.0.0.2. Receiving 1, sending 0
  INFO  2016-09-21 16:18:34,098 [shard 0] storage_service - Stop transport: stop_gossiping done
  INFO  2016-09-21 16:18:34,098 [shard 0] storage_service - Thrift server stopped
  INFO  2016-09-21 16:18:34,098 [shard 0] storage_service - CQL server stopped
  INFO  2016-09-21 16:18:34,098 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
  INFO  2016-09-21 16:18:34,155 [shard 0] messaging_service - Retry verb=19 to 127.0.0.2:0, retry=10: rpc::closed_error (connection is closed)
  WARN  2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] COMPLETE_MESSAGE for 127.0.0.2 has failed: rpc::closed_error (connection is closed)
  WARN  2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Streaming error occurred
  INFO  2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Session with 127.0.0.2 is complete, state=FAILED
  INFO  2016-09-21 16:18:34,155 [shard 0] storage_service - messaging_service stopped
  INFO  2016-09-21 16:18:34,155 [shard 0] storage_service - Stop transport: shutdown messaging_service done
  INFO  2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] bytes_sent = 0, bytes_received = 245000
  WARN  2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Stream failed, peers={127.0.0.2}
  WARN  2016-09-21 16:18:34,155 [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed)
  INFO  2016-09-21 16:18:34,155 [shard 0] repair - repair 1 failed - streaming::stream_exception (Stream failed)
  INFO  2016-09-21 16:18:34,155 [shard 0] storage_service - Stop transport: auth shutdown
  INFO  2016-09-21 16:18:34,155 [shard 0] storage_service - Stop transport: done
  INFO  2016-09-21 16:18:34,155 [shard 0] storage_service - Drain on shutdown: stop_transport done
  INFO  2016-09-21 16:18:34,155 [shard 0] tracing - Asked to shut down
  INFO  2016-09-21 16:18:34,155 [shard 0] tracing - Tracing is down
  INFO  2016-09-21 16:18:34,156 [shard 1] tracing - Asked to shut down
  INFO  2016-09-21 16:18:34,156 [shard 1] tracing - Tracing is down
  INFO  2016-09-21 16:18:34,156 [shard 0] storage_service - Drain on shutdown: tracing is stopped
  INFO  2016-09-21 16:18:34,199 [shard 0] storage_service - Drain on shutdown: flush column_families done
  INFO  2016-09-21 16:18:34,199 [shard 0] storage_service - Drain on shutdown: shutdown commitlog done
  INFO  2016-09-21 16:18:34,199 [shard 0] storage_service - Drain on shutdown: done
  INFO  2016-09-21 16:18:34,199 [shard 0] repair - Starting shutdown of repair
  INFO  2016-09-21 16:18:34,199 [shard 0] repair - Completed shutdown of repair
  INFO  2016-09-21 16:18:34,199 [shard 0] compaction_manager - Asked to stop
  INFO  2016-09-21 16:18:34,199 [shard 1] compaction_manager - Asked to stop

After:

  INFO  2016-09-21 16:06:21,684 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Executing streaming plan for repair-in
  INFO  2016-09-21 16:06:21,684 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Starting streaming to 127.0.0.2
  INFO  2016-09-21 16:06:21,684 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Beginning stream session with 127.0.0.2
  INFO  2016-09-21 16:06:21,685 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Prepare completed with 127.0.0.2. Receiving 1, sending 0
  INFO  2016-09-21 16:06:23,687 [shard 0] storage_service - Stop transport: stop_gossiping done
  INFO  2016-09-21 16:06:23,687 [shard 0] storage_service - Thrift server stopped
  INFO  2016-09-21 16:06:23,687 [shard 0] storage_service - CQL server stopped
  INFO  2016-09-21 16:06:23,687 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
  INFO  2016-09-21 16:06:23,688 [shard 0] storage_service - messaging_service stopped
  INFO  2016-09-21 16:06:23,688 [shard 0] storage_service - Stop transport: shutdown messaging_service done
  INFO  2016-09-21 16:06:23,688 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Session with 127.0.0.2 is complete, state=FAILED
  INFO  2016-09-21 16:06:23,688 [shard 0] storage_service - stream_manager stopped
  INFO  2016-09-21 16:06:23,688 [shard 1] storage_service - stream_manager stopped
  INFO  2016-09-21 16:06:23,688 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] bytes_sent = 0, bytes_received = 25725
  INFO  2016-09-21 16:06:23,688 [shard 0] storage_service - Stop transport: shutdown stream_manager done
  WARN  2016-09-21 16:06:23,688 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Stream failed, peers={127.0.0.2}
  WARN  2016-09-21 16:06:23,688 [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed)
  INFO  2016-09-21 16:06:23,688 [shard 0] repair - repair 1 failed - streaming::stream_exception (Stream failed)
  INFO  2016-09-21 16:06:23,688 [shard 0] storage_service - Stop transport: auth shutdown
  INFO  2016-09-21 16:06:23,688 [shard 0] storage_service - Stop transport: done
  INFO  2016-09-21 16:06:23,688 [shard 0] storage_service - Drain on shutdown: stop_transport done
  INFO  2016-09-21 16:06:23,688 [shard 0] tracing - Asked to shut down
  INFO  2016-09-21 16:06:23,688 [shard 0] tracing - Tracing is down
  INFO  2016-09-21 16:06:23,688 [shard 1] tracing - Asked to shut down
  INFO  2016-09-21 16:06:23,688 [shard 1] tracing - Tracing is down
  INFO  2016-09-21 16:06:23,688 [shard 0] storage_service - Drain on shutdown: tracing is stopped
  INFO  2016-09-21 16:06:23,774 [shard 0] storage_service - Drain on shutdown: flush column_families done
  INFO  2016-09-21 16:06:23,774 [shard 0] storage_service - Drain on shutdown: shutdown commitlog done
  INFO  2016-09-21 16:06:23,774 [shard 0] storage_service - Drain on shutdown: done
  INFO  2016-09-21 16:06:23,774 [shard 0] repair - Starting shutdown of repair
  INFO  2016-09-21 16:06:23,774 [shard 0] repair - Completed shutdown of repair
  INFO  2016-09-21 16:06:23,774 [shard 0] compaction_manager - Asked to stop
  INFO  2016-09-21 16:06:23,774 [shard 1] compaction_manager - Asked to stop
2016-09-26 06:29:40 +08:00
Asias He
7c873f0d1f repair: Reduce unnecessary streaming traffic
If the remote peers have the same checksum, we can only fetch from
one of the peer node instead of all of them since they all have the same
data anyway. No need to fetch from all of them.

In addition to above optimization, if the local peer has no data, we can
skip sending the data back to the remote peer. Due to the fact that all
the remote peers have the same checksum and local peer has no data, so
each and every remote peer has all the data. There is no need to merge
the remote data with local data and send back the merged data back to
remote peers.

Refs: #1617
2016-09-26 06:28:51 +08:00
Asias He
99e77e8ec2 repair: Do not abort the repair when one range is failed
failed_ranges is added to track the ranges that fail during repair.
2016-09-26 06:28:51 +08:00
Asias He
81c98ff3d9 repair: Reduce stream_plan usage
Right now, we are using one stream_plan for each range of a column
family. This generates tons of stream_plans and stream_sessions. Each
stream_plan can transfer multiple ranges and column families. We can
use a single stream_plan to stream datas for multiple ranges and column
families, so that 1) overhead of stream_plan/session negotiation is
reduced 2) it is much easier to debug/monitor few stream_sessions

Fixes #1685
2016-09-26 06:28:50 +08:00
Asias He
a0020fdad2 stream_session: Allow adding ranges to a cf more than once
Append the ranges to a stream_transfer_task if the cf is already added to
_transfers in add_transfer_ranges.
2016-09-26 06:28:50 +08:00
Asias He
576e15532f streaming: Add append_ranges for stream_transfer_task
Allow to append more ranges to transfer for a stream transfer task.
2016-09-26 06:28:50 +08:00
Avi Kivity
3057ca05bc Merge "Improve loggging when nodes are decommissioned" from Asias
"When a node is decommissioned, its gossip state will not be removed from gossip
immediately. It will only be removed 3 days later which helps nodes that were
down when the node was decommissioned to know decommission later when they are
up again.

This series improves the logging to reduce confusion when a node tries to
talking to a decommissioned node. In addition, we now do not try to talk to the
decommissioned in the unreachable_endpoints gossip round.

Fixes #1615"

* tag 'asias/loggging_decommissioned_nodes/v1' of github.com:cloudius-systems/seastar-dev:
  gossip: Make two log items debug level
  gossip: Print node status when node is UP or DOWN
  gossip: Ignore the node which is decommissioned in gossip round
  gossip: Print convict debug info only when the node is alive
  gossip: Add more timing log in add_expire_time_for_endpoint
  streaming: Print on_remove and on_restart log when peer exists
  streaming: Introduce has_peer in stream_manager
2016-09-25 15:19:13 +03:00
Asias He
830f4ee353 gossip: Make two log items debug level
It is duplciated with "InetAddresss x.x.x.x is now UP" message.

INFO  2016-09-23 10:35:15,512 [shard 0] gossip - Node 127.0.0.1 has restarted, now UP, status = NORMAL
INFO  2016-09-23 10:35:15,513 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = NORMAL

Make the log a bit cleaner.
2016-09-25 07:17:19 +08:00
Asias He
a26a26963c gossip: Print node status when node is UP or DOWN
For example:

gossip - InetAddress 127.0.0.4 is now UP, status = NORMAL
gossip - InetAddress 127.0.0.3 is now DOWN, status = LEFT
gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
2016-09-25 07:17:19 +08:00
Asias He
1d9401d080 gossip: Ignore the node which is decommissioned in gossip round
If the node is decommissioned, there is no point to try to contact it
again in the gossip round.
2016-09-25 07:17:19 +08:00
Asias He
4b73443222 gossip: Print convict debug info only when the node is alive 2016-09-25 07:17:19 +08:00
Asias He
99a2ae0fb5 gossip: Add more timing log in add_expire_time_for_endpoint
It tells when the node is expected to expire and how many seconds are
left.
2016-09-25 07:17:19 +08:00
Asias He
40f7a355a0 streaming: Print on_remove and on_restart log when peer exists
We print the following messages even if there is no stream_session with
that peer. It is a bit confusing.

  INFO  2016-09-23 08:26:37,254 [shard 0] stream_session - stream_manager:
  Close all stream_session with peer = 127.0.0.1 in on_restart

  INFO  2016-09-23 08:26:37,287 [shard 0] stream_session - stream_manager:
  Close all stream_session with peer = 127.0.0.3 in on_remove

Print only when the streaming session with the peer exists.
2016-09-25 07:17:19 +08:00
Asias He
2ac4ce77a9 streaming: Introduce has_peer in stream_manager
It is used to query if a streaming peer with inet_address exists.
2016-09-25 07:17:13 +08:00
Nadav Har'El
fe1ba753ce Avoid semaphore's default initial value
The fact that Seastar's semaphore has a default initializer of 1 if not
explicitly initialized is confusing and unexpected and recently lead to
two bugs. So ScyllaDB should not rely on this default behavior, and specify
the initial value of each semaphore explicitly.

In several cases in the ScyllaDB code, the explict initialization was
missing, and this patch adds it. In one case (rate_limiter) I even think
the default of 1 was a bit strange, and 0 makes more sense.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1474530745-23951-1-git-send-email-nyh@scylladb.com>
2016-09-24 19:25:02 +03:00
Paweł Dziepak
eb59b4c4ab keys: disable constructing from generic range
stdx::optional<T> uses quite elaborate std::enable_if_t magic to decide
whether the argument passed to its constructor should be used for a call
T constructor or stdx::optional<T> constructor.

Apparently, with GCC 6.2 having T constructor which accepts any type
confuses that magic and we end up with compile errors.

The solution is to have from_range() method that replaces that
constructor from range. There is also constructor that creates a key
from std::vector<bytes> so that code generated by IDL works as it did
before.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1474550971-15309-1-git-send-email-pdziepak@scylladb.com>
2016-09-24 18:57:01 +03:00
Raphael S. Carvalho
cfe7419f0f sstables: update or remove some outdated comments
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <74bae503447da2544a005e29b7d3aafa9f6e8c90.1474383273.git.raphaelsc@scylladb.com>
2016-09-24 18:53:19 +03:00
Raphael S. Carvalho
0f1bd3c527 db: fix clustering key filter
When date tiered strategy is enabled, filter_sstable_for_reader()
was returning more sstables than needed because the return type
of serialized_tri_compare::operator() was wrong, which results
in bad performance.

tgrabiec: Refs #1449

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0301d7588e33c7bbb8cd80fed20a1827926a8fff.1474585088.git.raphaelsc@scylladb.com>
2016-09-23 12:33:58 +02:00
Asias He
e352570f52 conf: Move initial_token to supported section in scylla.yaml
initial_token is actually supported

Fixes #1686
Message-Id: <465da088696f72a3a7bcf19ba8e4895a0a648e7c.1474512235.git.asias@scylladb.com>
2016-09-23 09:34:05 +03:00
Tomasz Grabiec
0b0d126721 Merge seastar upstream
Fixes #1622.
Fixes #1690.

* seastar 40a68fa...2b55789 (5):
  > input_stream: Fix possible infinite recursion in consume()
  > iostream: Fix stack overflow in output_stream::split_and_put()
  > condition_variable: fix spurious wakeup
  > Merge "assorted rpc fixes" from Gleb
  > Merge "Simple fixes for doxygen" from Glauber
2016-09-22 14:27:52 +02:00
Amnon Heiman
a6749116a7 scylla_setup: Install node_exporter
This adds the option to install node_exporter during setup.

The node_exporter export server information in the prometheus API.
It should be used when using the scylla prometheus API to get the server
information.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-09-22 09:43:53 +03:00
Amnon Heiman
4e0dcb59e7 scylla.spec: package the node_exporter scripts
This patch adds the node_exporter related files to the rpm.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-09-22 09:43:53 +03:00
Amnon Heiman
3d242fdb4d Add a link to node_exporter_install
This adds a link to node_exporter_install in sbin, so it will be
availabe in the path.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-09-22 09:43:53 +03:00
Amnon Heiman
9d3edd3a28 service file for node_exporter with systemd
This patch adds a service file for OS that supports systemd.

When started, it would run an already installed node_exporter or fail.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-09-22 09:43:53 +03:00
Amnon Heiman
801b2c4914 An installation script for node_exporter
node_exporter is a utility that export node information via prometheus
API. It takes care of host related metrics such as CPU and memory.

The install script, download the node_exporter binaries, create a link
in /usr/bin.

On OS with systemd supported it would enable and start the installed
service file to start as a service. On others (ubuntu) it would create a conf file and start it.

The installation should be done using sudo.

After a successful installation, the node_exporter would run as a
service.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-09-22 09:33:03 +03:00
Paweł Dziepak
906250dcbd Merge "Enhance GDB script with new LSA-related commands" from Tomek 2016-09-21 13:22:00 +01:00
Raphael S. Carvalho
67343798cf api: implement api to return sstable count per level
'nodetool cfstats' wasn't showing per-level sstable count because
the API wasn't implemented.

Fixes #1119.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0dcdf9196eaec1692003fcc8ef18c77d0834b2c6.1474410770.git.raphaelsc@scylladb.com>
2016-09-21 09:13:40 +03:00
Asias He
aa47265381 gossip: Fix std::out_of_range in setup_collectd
It is possible that endpoint_state_map does not contain the entry for
the node itself when collectd accesses it.

Fixes the issue:

Sep 18 11:33:16 XXX scylla[19483]: [shard 0] seastar - Exceptional
future ignored: std::out_of_range (_Map_base::at)

Fixes #1656

Message-Id: <8ffe22a542ff71e8c121b06ad62f94db54cc388f.1474377722.git.asias@scylladb.com>
2016-09-20 19:38:16 +03:00
Tomasz Grabiec
69aec3835f scylla-gdb: Enhance 'scylla ptr' to show if object is managed by LSA
Example:

  (gdb) scylla ptr 0x601000480000
  thread 1, large, LSA-managed

One can then use 'scylla lsa-segment 0x601000480000' to examine LSA
segment contents.
2016-09-20 16:53:23 +02:00
Tomasz Grabiec
486a92092b scylla-gdb: Add 'scylla segment-descs' command
Displays information about shard's segment descriptors. One can see
which segments belong to LSA, what's their occupancy, etc.

(gdb) scylla segment-descs
...
0x601000940000: lsa free=26092 region=0x60100036d890 zone=0x6010000fb420
0x601000980000: lsa free=26092 region=0x60100036d890 zone=0x6010000fb420
0x6010009c0000: lsa free=261153 region=0x60100036fcf0 zone=0x6010000fb420
0x601000a00000: std
0x601000a40000: lsa free=25508 region=0x60100036d890 zone=0x6010000fb420
0x601000a80000: std
0x601000ac0000: lsa free=26092 region=0x60100036d890 zone=0x6010000fb420
0x601000b00000: lsa free=26092 region=0x60100036d890 zone=0x6010000fb420
0x601000b40000: std
...
2016-09-20 16:53:23 +02:00
Tomasz Grabiec
b0b28696b5 scylla-gdb: Add 'scylla lsa-segment' command
Allows one to examine contents of LSA segment.

Example:

  (gdb) scylla lsa-segment 0x601000480000
  0x601000480e70: live size=144 migrator=standard_migrator<cache_entry>::object
  0x601000480f10: live size=144 migrator=standard_migrator<cache_entry>::object
  0x601000480fb0: free size=192
  0x60100048107e: free size=42
  0x6010004814e0: free size=192
  0x6010004815ae: free size=40
  0x6010004815e8: free size=192
  0x6010004816b8: live size=144 migrator=standard_migrator<cache_entry>::object
  0x601000481758: free size=192
  ...
2016-09-20 16:53:21 +02:00
Tomasz Grabiec
5011b77e15 scylla-gdb: Add std::vector wrapper
Makes vector values itearable from python level.
2016-09-20 16:53:20 +02:00
Pekka Enberg
42dd4670dc transport/server: Add CQL frame Snappy compression support
Fixes #1286
Message-Id: <1474370861-5928-1-git-send-email-penberg@scylladb.com>
2016-09-20 12:33:36 +01:00
Pekka Enberg
acc93509a2 transport/server: Fix CQL connection compression negotiation
Benoît Canet points out that CQL messages are not always compressed
although compression is enabled by the driver. Turns out our CQL
compression negotiation is broken. We need to negotiate compression upon
STARTUP message and not rely on the incoming request to have the
compression bit enabled.

Fixes #1680
Message-Id: <1474366693-3001-1-git-send-email-penberg@scylladb.com>
2016-09-20 11:19:27 +01:00
Pekka Enberg
f92bbc6f44 cql3: Kill unimplemented query_options constructor
The constructor was added in commit 7f3ce39 ("query_options: Add
constructor for batch mode options (multi-level)") but apparently it was
never actually implemented.

Spotted by CLion.
Message-Id: <1474303017-23383-1-git-send-email-penberg@scylladb.com>
2016-09-20 10:01:10 +01:00
Pekka Enberg
f1d0401ed2 main: Use proper logger for API server messages
We have a "startlog" that we can use to print out API server messages.

Message-Id: <1474358312-26510-1-git-send-email-penberg@scylladb.com>
2016-09-20 11:09:59 +03:00
Pekka Enberg
38b137713f transport/server: Fix CQL v1 prepared statement execution
The EXECUTE message encoding is different between CQL binary protocol
versions v1 and v2 (and later). Fix process_execute() to deserialize the
message as per the CQL binary protocol v1 specification:

    Executes a prepared query. The body of the message must be:
      <id><n><value_1>....<value_n><consistency>
    where:
      - <id> is the prepared query ID. It's the [short bytes] returned as a
        response to a PREPARE message.
      - <n> is a [short] indicating the number of following values.
      - <value_1>...<value_n> are the [bytes] to use for bound variables in the
        prepared query.
      - <consistency> is the [consistency] level for the operation.

Fixes #1676

Message-Id: <1474287392-16792-1-git-send-email-penberg@scylladb.com>
2016-09-19 15:26:30 +03:00
Raphael S. Carvalho
0eaa0f46c9 sstables: store first and last decorated keys in sstable object
leveled strategy uses heavily first and last decorated keys of a
sstable to get overlapping sstables in a given level. By storing
first and last decorated keys in sstable object, it's expected
that performance of leveled strategy (not compaction) will be
improved.
We will set first and last keys in sstable when either loading
or sealing it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0abca819454ab4c088541bb49714f1f6a7dc4f42.1473959677.git.raphaelsc@scylladb.com>
2016-09-19 13:25:58 +02:00
Raphael S. Carvalho
dffb41f9d8 sstables: remove schema parameter from some sstable methods
schema can now be found in the sstable object itself.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0fa44fedbe784d924522d7eeca77c16294479c6e.1473959677.git.raphaelsc@scylladb.com>
2016-09-19 13:25:58 +02:00
Tomasz Grabiec
2282599394 tests: Add test for UUID type ordering
Message-Id: <1473956716-5209-2-git-send-email-tgrabiec@scylladb.com>
2016-09-16 11:07:14 +01:00
Tomasz Grabiec
804fe50b7f types: fix uuid_type_impl::less
timeuuid_type_impl::compare_bytes is a "trichotomic" comparator (-1,
0, 1) while less() is a "less" comparator (false, true). The code
incorrectly returns c1 instead of c1 < 0 which breaks the ordering.

Fixes #1196.
Message-Id: <1473956716-5209-1-git-send-email-tgrabiec@scylladb.com>
2016-09-16 11:06:55 +01:00
Duarte Nunes
bc3cbb7009 thrift: Correctly detect clustering range wrap around
This patch uses the clustering bounds comparator to correctly detect
wrap around of a clustering range in the thrift handler.

Refs #1446

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1473938611-8590-1-git-send-email-duarte@scylladb.com>
2016-09-15 14:31:16 +01:00
Shlomi Livne
acb83073e2 ami: Fix instructions how to run scylla_io_setup on non ephemeral instances
On instances differenet then i2/m3/c3 we provide instructions to run
scylla_ip_setup. Running scylla_io_setup requires access to
/var/lib/scylla to crate a temporary file. To gain access to that
directory the user should run 'sudo scylla_io_setup'.

refs: #1645

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <4ce90ca1ba4da8f07cf8aa15e755675463a22933.1473935778.git.shlomi@scylladb.com>
2016-09-15 13:40:53 +03:00
Avi Kivity
d2b5f3ff44 Merge seastar upstream
* seastar e534401...40a68fa (1):
  > rpc: fix dangling reference in read_rcv_buf
2016-09-15 12:20:49 +03:00
Gleb Natapov
2e8b255741 Merge seastar upstream
* seastar 0303e0c...e534401 (6):
  > Merge "enable rpc to work on non contiguous memory for receive" from Gleb
  > install-dependencies.sh: install python3 for Ubuntu/Debian, which requires for configure.py
  > fix tcp stuck when output_stream write more than 212992 bytes once.
  > scripts/posix_net_conf.sh: supress 'ls: cannot access /sys/class/net/<NIC>/device/msi_irqs/' error message
  > scripts/posix_net_conf.sh: fix 'command not found' error when specifies --cpu-mask
  > native_network_stack: Fix use after free/missing wait in dhcp

Includes: "Remove utils::fragmented_input_stream and utils::input_stream in favor of seastar version" from Gleb.
2016-09-15 12:12:16 +03:00
Tomasz Grabiec
ed312c2b1a Merge remote-tracking branch 'duarte/comparator/v1'
From Duarte:

This patchset reuses the bound_view::comparator in range_tombstone to
correctly detect wrap around of a clustering range. This fixes a
manifestation of #1446 that results in wrong query results.

Introduced by b1f9688432

Fixes #1669
Refs #1446
2016-09-14 18:21:05 +02:00
Paweł Dziepak
bc2ff41003 cql3: fix units in large batch warning
When displaying a warning about batch being too large C* reports batch
size and limit in bytes while S* uses kB.

This patch switches Scylla to use bytes.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1473867171-18932-1-git-send-email-pdziepak@scylladb.com>
2016-09-14 18:38:46 +03:00
Takuya ASADA
647673195c dist/redhat/build_rpm.sh: add dependency for rpmbuild
Install rpmbuild when it's not installed yet.
Fixes #1651

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1473193430-14792-1-git-send-email-syuu@scylladb.com>
2016-09-14 14:57:55 +03:00
Calle Wilund
f126cf769a column_family: Ensure flush() waits for all previous flushes + self
Fixes #1577
Message-Id: <1472569952-4066-1-git-send-email-calle@scylladb.com>
2016-09-14 11:00:41 +01:00
Duarte Nunes
f864bca773 row_cache: Deal with side-effects in allocating_section
In row_cache::make_reader, we update statistics inside an
allocating_section, which retries the supplied function until it can
satisfy all allocations by way of reserving LSA memory up front. Since
those updates are interleave with allocations, retries can lead to
miscounts.

This patch fixes this by updating statistics after all allocations.

Fixes #1659

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1473845977-20205-1-git-send-email-duarte@scylladb.com>
2016-09-14 10:46:25 +01:00
Tomasz Grabiec
a498da1987 database: Ignore spaces in initial_token list
Currently we get boost::lexical_cast on startup if inital_token has a
list which contains spaces after commas, e.g.:

  initial_token: -1100081313741479381, -1104041856484663086, ...

Fixes #1664.
Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>
2016-09-14 11:58:13 +03:00
Paweł Dziepak
c220c676c8 types: honour end of sstring_view
There are several places in types.cc where we assume that sstring_view
range is null terminated. That may be not true and we should always use
either begin()/end() or data()/size() pairs.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-09-07 14:30:56 -07:00
Paweł Dziepak
6373289532 Merge "Adding slow query API" from Amnon
"This series adds an API for the slow query recording.

After this series it will be possible to set the/get the slow query
recording parameters."
2016-09-07 11:06:09 -07:00
Pekka Enberg
1095705a6b Update scylla-ami submodule
* dist/ami/files/scylla-ami 14c1666...e1e3919 (1):
  > scylla_ami_setup: remove scylla_cpuset_setup
2016-09-07 21:04:03 +03:00
Avi Kivity
7ac729b4d5 Merge "Optimize reads for clustered data" from Raphael
"This will be very important for read performance of time series use case,
where timestamp is usually stored as a clustering key, and the user asks
for specific data using a clustering range filter. Example:
    CREATE TABLE temperature (
        weatherstation_id text,
        event_time timestamp,
        temperature text,
        PRIMARY KEY (weatherstation_id,event_time)
    );
    ...
    SELECT * FROM temperature
        WHERE weatherstation_id='1234ABCD'
        AND event_time > '2013-04-03 07:01:00'
        AND event_time < '2013-04-03 07:04:00';

This is based on: https://issues.apache.org/jira/browse/CASSANDRA-5514

To check correctness, I wrote a dtest that runs scylla with row cache disabled,
creates several sstables with non overlapping clustering key ranges, queries
data using several clustering range filters, and checks that the database
returns the expected results.

Tested performance with a tool I wrote myself [1] and performance is indeed
improved by this patchset. This tool works as follow:
Scylla is started with row cache disabled. That's wanted here because we're
measuring a specific code that only gets executed if row cache misses the data
we asked for. Then Scylla is populated node with N sstables ('nodetool flush'
is used to ensure it), where each will have M clustering keys, totaling N*M
clustering keys. Finally, we will start asking for data using a clustering
range filter. The tool measures throughput and min/max/avg latency.

[1]: https://gist.github.com/raphaelsc/4c415f592aaed14a18be31279d225972

Follow the results:

BEFORE
-----
('Clustering keys / second: ', 747.9672111659951)
('Max latency (ms): ', 33)
('Min latency (ms): ', 12)
('Avg latency (ms): ', 13.0)
The operation took 13.3695700169 seconds

AFTER
-----
('Clustering keys / second: ', 3159.115303945648)
('Max latency (ms): ', 22)
('Min latency (ms): ', 2)
('Avg latency (ms): ', 3.0)
The operation took 3.16544318199 seconds

NOTE: Throughput and average latency are improved by a factor of ~4.
-----"
2016-09-04 15:06:32 +03:00
Amnon Heiman
11c687dd93 API: Add slow query logging implementation
This adds the implementation for the slow query logging API.

After this patch the following will be available:

curl -X GET  "http://localhost:10000/storage_service/slow_query"
curl -X POST
"http://localhost:10000/storage_service/slow_query?enable=true&ttl=10&threshold=6000"

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-09-03 01:15:22 +03:00
Amnon Heiman
ed1d02b1a3 API: Add slow query API definition
This adds the GET and POST api for slow query logging.

The GET return an object with the enable, ttl and threshold and the POST
lets you configure each of them.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-09-03 01:15:15 +03:00
Raphael S. Carvalho
b9f67351da db: expose clustering filter info via collectd
That's needed to observe behavior of clustering filter, and to
check if it's worthwhile for a specific workload.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 11:32:23 -03:00
Raphael S. Carvalho
a2dc88889d db: enable clustering optimization only on dtcs
Leveled strategy will not benefit from this strategy because
there's only a few sstables that will contain a given partition
key, which means that a clustering key that belongs to a specific
partition key can only be in a few sstables as well.

Date tiered strategy is the one that will actually benefit the
most from this optimization. Size tiered may benefit from it too
if clustering key isn't overwritten, but it will not use the
clustering optimization.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 11:31:07 -03:00
Raphael S. Carvalho
8d03ccd604 sstables: optimize reads with clustering filter
If user specifies a clustering filter, it's possible to filter out
sstable based on its metadata that tracks min/max clustering value.

For example, if sstable stores clustering key from 'a' through 'c',
it's possible to filter out that sstable if user asks for data
with clustering key greater than 'c'.

That's done by comparing each component separately because
clustering key may be composite. Further information can be found
here: https://issues.apache.org/jira/browse/CASSANDRA-5514

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:51:50 -03:00
Raphael S. Carvalho
768aced741 partition_slice: introduce key-independent function to get ranges
That will be important for sstable code that will rule out a sstable
if it doesn't cover a given clustering key range.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:50:56 -03:00
Raphael S. Carvalho
dce61ddb02 types: introduce abstract_type::as_tri_comparator()
That's akin to abstract_type::as_less_comparator's nature.
So we don't have to repeat something like the following everywhere:
auto cmp = [&type] (const bytes_view& b1, const bytes_view& b2) {
	return type->compare(b1, b2); }

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:50:53 -03:00
Raphael S. Carvalho
004617839d database: check bloom filter of all sstables earlier
All sstables will now have bloom filter checked in a single pass
before reader iterate through all candidates. It's possible that
we will need to futurize the procedure if it holds cpu for too
long. This change is also a step towards the optimization that
will rule out sstables based on clustering filter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:50:08 -03:00
Raphael S. Carvalho
2a426ab248 tests: add test to check tombstone metadata
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:35 -03:00
Raphael S. Carvalho
94c8ef39c3 sstables: store components ranges in sstable object
Store range for each clustering component in sstable itself to
optimize sstable filtering based on clustering key.
If schema defines no clustering key, this new field will be
empty. Each range stores min and max value of that specific
component. With this information, it's possible to know if a
sstable possibly stores a given clustering component.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:32 -03:00
Raphael S. Carvalho
026853fabb tests: add test to check composite validity
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:30 -03:00
Raphael S. Carvalho
0a5af61176 sstables: introduce function to validate min max clustering values
Scylla was generating a sstable with incorrect min max clustering
values. This information is used to filter out a sstable when user
asks for a range of clustering rows. So it's important to detect
wrong metadata and make sure that it will not be used.
The validation is fast and will only happen when loading a sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:28 -03:00
Raphael S. Carvalho
1f31223f32 sstables: store schema in sstable object
That will be needed for optimization that will store decorated keys
in the sstable object, and also for a subsequent work that will
detect wrong metadata (min/max column names) by looking at columns
in the schema. As schema is stored in sstable, there's no longer
a need to store ks and cf names in it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:17 -03:00
Avi Kivity
7a140a306e Revert "sstables: optimize selection of sstables for leveled strategy"
This reverts commit c75b07fc34f0e7267a8e49276b96bbd4686cb78d; does not
deduplicate the sstable list.
2016-09-01 18:34:08 +03:00
Raphael S. Carvalho
c75b07fc34 sstables: optimize selection of sstables for leveled strategy
It's possible to copy sstables directly into vector, and that will
improve performance. my benchmark tool[1] shows that new version
reduces running time of *copy procedure* by factor of two after
1024^2 calls.
Switching to back_inserter improves throughput even further.
[1]: gist.github.com/raphaelsc/a4b27290f362cdecdef399770dda759c

Refs #1632.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <7153514a9b5f5eb24dff518ee9fa3680e0881dae.1472741401.git.raphaelsc@scylladb.com>
2016-09-01 18:08:53 +03:00
Glauber Costa
dc5d8e33af Revert "row_cache: update sstable histograms on cache hits"
This reverts commit 1726b1d0cc.

Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.

Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.

The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.

Conflicts:
	row_cache.hh

Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-01 18:07:31 +03:00
Avi Kivity
e33671c285 Merge "tracing: Trace read sstables" from Duarte
"This patchset traces sstables we read from. To do that, we
need to flow the trace_state_ptr to the mutation_readers."
2016-09-01 13:24:16 +03:00
Duarte Nunes
ba374da043 database: Trace sstable accesses
This patch traces when we read from an sstable, be it a key range or a
single one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:32 +02:00
Duarte Nunes
f4cf2f2aef tracing: Make trace_state_ptr argument required
This patch makes the optional trace_state_ptr arguments introduced in
previous patches mandatory where possible. Functions which are called
internally don't have a trace context, so for those we keep the
argument's default value for convenience.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:32 +02:00
Duarte Nunes
46b86ff801 storage_proxy: Pass along trace_state for queries
This patch changes the storage_proxy so it passed along a
trace_state_ptr to the layers below, when querying locally or
receiving a remote query request.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:32 +02:00
Duarte Nunes
030db65c62 database: Accept a trace_state_ptr
This patch changes the database and column_family types so a
trace_state_ptr can be passed in when querying. This enables tracing
of the inner components.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:28 +02:00
Duarte Nunes
9269256246 row_cache: Accept a trace_state_ptr
This patch changes the row_cache so it accepts a trace_state_ptr,
which it is responsible of flowing to the underlying mutation_reader
if needed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:00:55 +02:00
Duarte Nunes
5fd66f00c2 mutation_reader: Accept trace_state_ptr
This patch changes the mutation_reader so it optionally accepts a
trace_state_ptr. This will allow us to trace, for example, which
sstables are accessed during a request.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:00:31 +02:00
Avi Kivity
cc127295e9 Merge "Fill in information for sstables per read histogram" from Glauber
"Nodetool cfhistograms is supposed to tell us how many SSTables were touched per
read. Currently, we are a bit in the dark as we don't export that information.

This patch exports that, so that we can start using it."
2016-09-01 12:54:24 +03:00
Glauber Costa
1726b1d0cc row_cache: update sstable histograms on cache hits
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:22 -04:00
Glauber Costa
ce24fd05fe database: keep statistics on SSTables touched per read
That is done for single partition queries only - mimicking what
Cassandra does on that matter.

For this to be correct, we also need to update this histogram on cache
hits - in which case we update the read as having touched 0 SSTables. That
will be done on a separate patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:21 -04:00
Glauber Costa
0f413695ac database: make column family stats mutable
The make_reader method is currently a const method, but we would like to start
keeping hit statistics from it.

Instead of relaxing the const condition too much, we can just mark the _stats
field as mutable, indicating that make_reader will not be able to change
anything in the CF, except for keeping statistics.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:24 -04:00
Glauber Costa
5c4d73577a initialize sstables_per_read histogram with 35 instead of 90 buckets
This is to match what Cassandra does. Nodetool may be expecting this on the
other side.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:24 -04:00
Glauber Costa
4310635bae move estimated histogram to utils
Nothing sstable-specific in it, really.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:23 -04:00
Glauber Costa
ffc2131c51 decouple estimated_histogram from sstables
There is nothing really that fundamentally ties the estimated histogram to
sstables. This patch gets rid of the few incidental ties. They are:

 - the namespace name, which is now moved to utils. Users inside sstables/
   now need to add a namespace prefix, while the ones outside have to change
   it to the right one
 - sstables::merge, which has a very non-descriptive name to begin with, is
   changed to a more descriptive name that can live inside utils/
 - the disk_types.hh include has to be removed - but it had no reason to be
   here in the first place.

Todo, is to actually move the file outside sstables/. That is done in a separate
step for clarity.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:23 -04:00
Yoav Kleinberger
624165da79 scyllatop: dump all output to stdout instead of running a fancy console interface
Sometimes the user would like to dump all the metrics into a file or
pipe it to another program, as requested in issue #1506.
This patch makes scyllatop check if stdout is connected to a TTY,
and if not - it does not fire up the fancy urwid UI but instead, just
writes all it's collected metrics to stdout.

Optionally, the user tell the program to quit after a specific
number of iterations via the -n or --iterations flag

Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1471777516-9903-1-git-send-email-yoav@scylladb.com>
2016-08-31 08:31:36 +03:00
Paweł Dziepak
e981101fa9 Merge "Remove clustering_key_filtering_context" from Piotr
"clustering_key_filtering_context is no longer needed.
partition_slice can be used instead so this series removes
clustering_key_filtering_context and passes partition_slice down where
it's needed. Then a static get_ranges method is used to obtain
clustering key ranges for a given partition.

Fixes #1614."
2016-08-30 22:30:15 +01:00
Piotr Jastrzebski
3607d99269 Remove clustering_key_filtering_context.
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 20:31:55 +02:00
Piotr Jastrzebski
b05b90b3a5 Introduce clustering_key_filter_ranges.
This fixes the problem of multiple concurrent get_ranges calls.
Previously each call was invalidating the result of the previous
call. Now they don't step on each other foot.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 19:46:38 +02:00
Duarte Nunes
39e0fb1260 storage_proxy: Support multiple partition ranges
This patch adds the ability to query multiple partition ranges. This
is needed since 55f2cf1626, where we
started unwrapping partition ranges in Thrift.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1472474594-15368-1-git-send-email-duarte@scylladb.com>
2016-08-30 17:43:40 +03:00
Takuya ASADA
533dc0485d dist/common/scripts/scylla_sysconfig_setup: sync cpuset parameters with rps_cpus settings when posix_net_conf.sh is enabled and NIC is single queue
On posix_net_conf.sh's single queue NIC mode (which means RPS enabled mode), we are excluded cpu0 and it's sibling from network stack processing cpus, and assigned NIC IRQ to cpu0.
So always network stack is not working on cpu0 and it's sibling, to get better performance we need to exclude these cpus from scylla too.
To do this, we need to get RPS cpu mask from posix_net_conf.sh, pass it to scylla_cpuset_setup to construct /etc/scylla.d/cpuset.conf when scylla_setup executed.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472544875-2033-2-git-send-email-syuu@scylladb.com>
2016-08-30 16:51:16 +03:00
Takuya ASADA
0c3bb2ee63 dist/common/scripts/scylla_prepare: drop unnecesarry multiqueue NIC detection code on scylla_prepare
Right now scylla_prepare specifies -mq option to posix_net_conf.sh when number of RX queues > 1, but on posix_net_conf.sh it sets NIC mode to sq when queues < ncpus / 2.
So the logic is different, and actually posix_net_conf.sh does not need to specify -sq/-mq now, it autodetects queue mode.
So we need to drop detection logic from scylla_prepare, let posix_net_conf.sh to detect it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472544875-2033-1-git-send-email-syuu@scylladb.com>
2016-08-30 16:51:15 +03:00
Pekka Enberg
eff14bae0e transport/server: Explict CQL type IDs
The CQL type IDs are specified as hex in the CQL binary protocol
specification. Define CQL type IDs in the code explicitly to make
reviewing the code and adding new types easier.

Message-Id: <1472537971-26053-1-git-send-email-penberg@scylladb.com>
2016-08-30 09:45:26 +03:00
Avi Kivity
809d739ae8 Merge seastar upstream
* seastar 2b07b1f...0303e0c (3):
  > scripts/posix_net_conf.sh: add support --cpu-mask mode
  > file: improve tmpfs support
  > file::close: remove trailing newline in log message
2016-08-29 13:26:04 +03:00
Pekka Enberg
2d3aee73a6 systemd: Don't start Scylla service until network is up
Alexandr Porunov reports that Scylla fails to start up after reboot as follows:

  Aug 25 19:44:51 scylla1 scylla[637]: Exiting on unhandled exception of type 'std::system_error': Error system:99 (Cannot assign requested address)

The problem is that because there's no dependency to network service,
Scylla simply attempts to start up too soon in the boot sequence and
fails.

Fixes #1618.

Message-Id: <1472212447-21445-1-git-send-email-penberg@scylladb.com>
2016-08-29 13:15:39 +03:00
Takuya ASADA
74d994f6a1 dist/common/scripts/scylla_setup: support enabling services on Ubuntu 15.10/16.04
Right now it ignores Ubuntu, but we shareing .service between Fedora/CentOS and Ubuntu >= 15.10, so support it.

Fixes #1556.

Message-Id: <1471932814-17347-1-git-send-email-syuu@scylladb.com>
2016-08-29 13:13:14 +03:00
Avi Kivity
fb3a83a811 Merge "Slow query logging" from Vlad
"This series introduces a "slow query logging" feature that
allows logging the queries that take more than a specified
threshold time to complete.

Once such a query detected, it will be logged in a system_traces.node_slow_log table.

In addition all trace for that query that have been collected on a Coordinator
are going to be written as well.

If the handling time on a replica in the context of a query takes more than (the same) threshold
they are going to be written too.

The raw in a node_slow_log contains a session_id of a corresponding tracing session,
thereby allowing the user to query the system_traces tables for the corresponding trace
records.

The schema of the node_slow_log table is as follows:

CREATE TABLE system_traces.node_slow_log (
    node_ip inet,
    shard int,
    session_id uuid,
    date timestamp,
    start_time timeuuid,
    command text,
    duration int,
    parameters map<text, text>,
    source_ip inet,
    table_names set<text>,
    username text,
    PRIMARY KEY (start_time, node_ip, shard))
    WITH default_time_to_live = 86400

where
 - node_ip: IP of the coordinator Node.
 - shard: shard ID on a Coordinator where the query was handled.
 - session_id: ID of a corresponding tracing session.
 - date: a time when the query has began.
 - start_time: a time-based UUID for this query (needed for a primary key mostly).
 - command: a query string.
 - duration: a time it took to handle this query (in microseconds).
 - parameters: a map of query parameters (like in system_traces.sessions).
 - source_ip: IP of a Client that sent this query.
 - table_names: a set of "<keyspace>.<table name>" strings representing column
                families used in this query.
 - username: a user name used for this query.

The good thing is that most of the data we needed is already
collected by the regular tracing framework. The only missing ones
are a username and tables' names. So, this series makes the framework collect them too.

The whole feature is integrated in the Tracing framework. The main
changes to the framework that were made are as follows:
 - Store the constant capabilities of the tracing session in an enum_set, e.g.:
  - primary/secondary.
  - write on close.
 - Introduce two new capabilities to a tracing session of a specific query:
  - full tracing: collect all traces for this query (as it is before this series).
  - log slow query: log this query if its duration is above the threshold.
  These two capabilities may be defined independently.
 - Add the logic that handles the "log slow query"-only case:
  - Build the parameters<sstring, sstring> map only if the "duration" is above
    the given threshold.
  - The same about writing the trace entries.
 - In a not-only "log slow query" case:
  - Write the node_slow_log entry.
  - Extend the trace_info struct to pass slow query threshold and TTL to the replica
    Node.

In addition to above this series add the capability to configure the slow query logging
threshold and a TTL for the node_slow_log records.

The heaviest patch in the series is the last one. The series contains a few cosmetic (renaming)
patches that are meant to align the naming of the existing methods with the ones the last one
is going to add."
2016-08-29 13:11:36 +03:00
Gleb Natapov
a2cdddb795 storage_proxy: forward mutation write with correct timeout value
Now that mutation handler knows how much time is left for mutation
write to be handled it can use this knowledge to set correct timeout
for forwarded mutations.

Message-Id: <20160828080637.GE9243@scylladb.com>
2016-08-29 13:06:36 +03:00
Avi Kivity
6cb796f38b Merge seastar upstream
* seastar ef063c5...2b07b1f (1):
  > file: make close() more robust against concurrent calls
2016-08-29 12:25:57 +03:00
Avi Kivity
f5f58b46c7 sstables: enable write-behind
Write-behind allows a single sstable write to saturate the disk,
improving throughput.  Later we can take advantage of this to reduce
the number of sstables being written concurrently.
2016-08-29 12:25:15 +03:00
Pekka Enberg
c5e5e7bb40 dist/docker: Clean up Scylla description for Docker image
Message-Id: <1472145307-3399-1-git-send-email-penberg@scylladb.com>
2016-08-29 10:48:06 +03:00
Vlad Zolotarov
a491ac0f18 tracing: introduce a log_slow_query logic
The main idea is to log queries that take "too long" to complete.
The "too long" is above the given threshold.

To achieve the above this patch does the following:
   - Introduce two new properties to the tracing::trace_state:
      - "Full tracing": when the tracing of this query was explicitly requested.
        In this state we will record all possible traces related to this query:
        both on the coordinator and on any replica involved.
      - "Log slow query": when slow query logging is enabled.
        If slow query logging is enabled and a session's "duration" is above
        the specified threshold we will create a record in the "slow queries log"
        and write all trace records created on the coordinator and on a replica
        if a replica's session lasts longer than that threshold.
        (We will propagate the Coordinator's slow query logging threshold to replicas
        in the context of a specific tracing/logging session).

     The properties above are independent, namely they may be enabled and/or disabled
     independently and any combination of them is legal (naturally, creating a tracing
     session when both states above are disabled makes no sense).
   - Instrument the tracing::tracing service to allow the following:
    - Enable/disable slow query logging.
    - Set/get the slow query duration threshold (in microseconds).
    - Set/get the slow query log record TTL value (in seconds).
   - Instrument the trace_keyspace_helper to write a slow query log entry
     when requested.
   - The slow query logging is disabled by default and the threshold is set to half a second.
   - The TTL of a slow log record is set to 86400 seconds by default.
   - It makes sense to use the same "slow query logging threshold" and a "slow query record TTL"
     both on a coordinator and on a replica Nodes in a context of the same tracing session:
     - Pass both TTL and a threshold to the replica in a trace_info.

This patch also implements the new slow query logging specific logic:
   - Don't write the pending tracing records before the end of a tracing session
     until "duration" reaches the logging threshold.
   - Don't build the parameters<sstring, sstring> map unless we know we will write it
     to I/O.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-28 18:28:44 +03:00
Avi Kivity
e81c1df557 Merge seastar upstream
* seastar 6fadd98...ef063c5 (2):
  > rpc: pass a timeout to a verb's server handler if the one was specified by a client
  > rpc: cleanup the old metaprogramming craft
2016-08-25 17:53:19 +03:00
Paweł Dziepak
6012a7e733 mutation_partition: fix iterator invalidation in trim_rows
Reversed iterators are adaptors for 'normal' iterators. These underlying
iterators point to different objects that the reversed iterators
themselves.

The consequence of this is that removing an element pointed to by a
reversed iterator may invalidate reversed iterator which point to a
completely different object.

This is what happens in trim_rows for reversed queries. Erasing a row
can invalidate end iterator and the loop would fail to stop.

The solution is to introduce
reversal_traits::erase_dispose_and_update_end() funcion which erases and
disposes object pointed to by a given iterator but takes also a
reference to and end iterator and updates it if necessary to make sure
that it stays valid.

Fixes #1609.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1472080609-11642-1-git-send-email-pdziepak@scylladb.com>
2016-08-25 16:52:35 +03:00
Paweł Dziepak
5f84348ce1 test.py: add missing nonwrapping_range_test
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1472126087-15484-1-git-send-email-pdziepak@scylladb.com>
2016-08-25 15:36:10 +03:00
Piotr Jastrzebski
cda2e8f833 Remove stateless_clustering_key_filter_factory
It can be easily replaced with partition_slice_clustering_key_filter_factory.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-25 08:53:31 +02:00
Piotr Jastrzebski
5bf8807f9b Remove clustering_key_filtering_context::get_filter*
These methods are not used any more so they can go away.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-25 08:53:31 +02:00
Piotr Jastrzebski
7c9de37ef9 Remove clustering_key_filtering_context::want_static_columns
It's always true and clustering_key_filtering_context is
going away so the first step is to get rid of this method.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-25 08:53:31 +02:00
Raphael S. Carvalho
d8be32d93a api: use estimation of pending tasks in compaction manager too
We have API for getting pending compaction tasks both in column
family and compaction manager. Column family is already returning
pending tasks properly.
Compaction manager's one is used by 'nodetool compactionstats', and
was returning a value which doesn't reflect pending compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <a20b88938ad39e95f98bfd7f93e4d1666d1c6f95.1471641211.git.raphaelsc@scylladb.com>
2016-08-24 14:00:23 +03:00
Takuya ASADA
b9e02dad2e dist/ami: install scylla metapackage on AMI
Mistakenly we didn't install scylla metapackage on AMI, so install it.

Fixes #1572

Message-Id: <1471977742-21984-1-git-send-email-syuu@scylladb.com>
2016-08-24 12:55:01 +03:00
Vlad Zolotarov
8609900621 tracing: introduce trace_state capabilities bit field
- Instead of keeping separate booleans introduce a trace_state_props_set enum_set and
     pass it around instead of separate booleans.
   - Change the trace_info to hold this value in addition to write_on_close. Initialize
     a corresponding bit in an enum_set based on a write_on_close value in a trace_info
     constructor for a backward compatibility.
   - Separate a trace_state constructor into two:
      - For a primary session object.
      - For a secondary session object.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 18:34:36 +03:00
Amnon Heiman
2b98335da4 housekeeping: Silently ignore check version if Scylla is not available
Normally, the check version should start and stop with the scylla-server
service.

If it fails to find scylla server, there is no need to check the
version, nor to report it, so it can stop silently.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-08-23 18:08:59 +03:00
Amnon Heiman
4598674673 housekeeping: Use curl instead of Python's libraries
There is a problem with Python SSL's in Ubuntu 14.04:

  ubuntu@ip-10-81-165-156:~$ /usr/lib/scylla/scylla-housekeeping -q version
  Traceback (most recent call last):
    File "/usr/lib/scylla/scylla-housekeeping", line 94, in <module>
      args.func(args)
    File "/usr/lib/scylla/scylla-housekeeping", line 71, in check_version
      latest_version = get_json_from_url(version_url + "?version=" + current_version)["version"]
    File "/usr/lib/scylla/scylla-housekeeping", line 50, in get_json_from_url
      response = urllib2.urlopen(req)
    File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
      return _opener.open(url, data, timeout)
    File "/usr/lib/python2.7/urllib2.py", line 404, in open
      response = self._open(req, data)
    File "/usr/lib/python2.7/urllib2.py", line 422, in _open
      '_open', req)
    File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
      result = func(*args)
    File "/usr/lib/python2.7/urllib2.py", line 1222, in https_open
      return self.do_open(httplib.HTTPSConnection, req)
    File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
      raise URLError(err)
  urllib2.URLError: <urlopen error [Errno 1] _ssl.c:510: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure>

Instead of using Python libraries to connect to the check version
server, we will use curl for that.

Fixes #1600

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-08-23 18:07:05 +03:00
Amnon Heiman
91944b736e housekeeping: Add curl as a dependency
To work around an SSL problem with Python on Ubuntu 14.04, we need to
use curl. Add it as a dependency so that it's available on the host.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-08-23 18:06:13 +03:00
Vlad Zolotarov
c8cf2ef82c tracing::trace_state: introduce is_in_state() and set_state() accessors
Use these new methods to manipulate trace_state::_state value.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
39b23cd084 tracing::trace_state: rename: get_write_on_close() -> write_on_close()
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
09624f704f tracing::trace_state: rename: get_type() -> type()
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
b40a819d1e tracing::trace_state: rename: get_session_id() -> session_id()
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
ed21398ce9 trace_keyspace_helper: create a system_traces.node_slow_log table
This table is going to be used to store information about queries
which are slower than a specified threshold.

Also added a column caching and mutation creation functions

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
efeb62e72f tracing: trace_keyspace_helper: introduce a check_column_definition() helper function
Checks if a given column definition exists and has a requested type.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
6c3e1935b0 tracing::session_record: change a type of a "ttl" field to be std::chrono::seconds
TTL is always defined in seconds - make its type explicitly reflect that.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
e017533229 tracing: set a username session parameter
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
93c2502be4 tracing: set a table_name in a BATCH query
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
25be28bb3c tracing: set a table_name parameter in a SELECT statement
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
abae3b05e7 tracing: set table_name in a modification statement
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
372da7e71b tracing: add support for setting a username and a table name parameters
- "username" is a name used in the authentication process.
  - "table name" is a <keyspace>.<cf name> string representing a name
    of a table used for a query in question.

Note that there may be more than one table name in a batch query. Therefore
we store an unordered set of tables names.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:42 +03:00
Vlad Zolotarov
eaf5db66a8 tracing::session_record: store "parameters" data in an std::map instead of in an unordered_map
Avoid sorting (and creating a new one) container at a backend code when a sorted
container is needed.

The overhead for the backends where it's not needed is minimal since the size of the
map is very small.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-23 17:58:10 +03:00
Takuya ASADA
80f7449095 dist/ubuntu: support scylla-housekeeping service on all Ubuntu versions
Current scylla-housekeeping support on Ubuntu has bug, it does not installs .service/.timer for Ubuntu 16.04.
So fix it to make it work.

Fixes #1502

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Tested-by: Amos Kong <amos@scylladb.com>
Message-Id: <1471607903-14889-1-git-send-email-syuu@scylladb.com>
2016-08-23 13:49:44 +03:00
Takuya ASADA
aac60082ae dist/common/systemd: don't use .in for scylla-housekeeping.*, since these are not template file
.in is the name for template files witch requires to rewrite on building time, but these systemd unit files does not require rewrite, so don't name .in, reference directly from .spec.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1471607533-3821-1-git-send-email-syuu@scylladb.com>
2016-08-23 13:49:09 +03:00
Duarte Nunes
440c1b2189 thrift: Avoid always recording size estimates
Size estimates for a particular column family are recorded every 5
minutes. However, when a user calls the describe_splits(_ex) verbs,
they may want to see estimates for a recently created and updated
column family; this is legitimate and common in testing. However, a
client may also call describe_splits(_ex) very frequently and
recording the estimates on every call is wasteful and, worse, can
cause clients to give up. This patch fixes this by only recording
estimates if the first attempt to query them produces no results.

Refs #1139

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1471900595-4715-1-git-send-email-duarte@scylladb.com>
2016-08-23 13:08:25 +03:00
Takuya ASADA
383148de13 dist/common/scripts/scylla_bootparam_setup: fix failing setup hugepages= variable on boot parameter
This is caused because mistakenly dropped sourcing sysconfig file, so source it again.
Fixes #1599

Message-Id: <1471943742-19684-1-git-send-email-syuu@scylladb.com>
2016-08-23 12:41:39 +03:00
Takuya ASADA
1ad578ecf1 dist/common/scripts/scylla_bootparam_setup: use distribution standard grub.cfg update command on Ubuntu
Result is almost same, but let's do it in ubuntu/debian flavor.

Message-Id: <1471943898-24490-1-git-send-email-syuu@scylladb.com>
2016-08-23 12:41:34 +03:00
Paweł Dziepak
5feed84e32 sstables: do not call consume_end_partition() after proceed::no
After state_processor().process_state() returns proceed::no the upper
layer should have a chance to act before more data is pushed to the
consumer. This means that in case of proceed::no verify_end_state()
should not be called immediately since it may invoke
consume_end_partition().

Fixes #1605.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1471943032-7290-1-git-send-email-pdziepak@scylladb.com>
2016-08-23 12:24:39 +03:00
Duarte Nunes
ee2694e27d cql3: Consider bound type when detecting wrap around
This patch uses the clustering bounds comparator to correctly detect
wrap around of a clustering range. This fixes a manifestation of #1446,
introduced by b1f9688432, where a query
such as select * from cf where k = 0x00 and c0 = 0x02 and c1 > 0x02
would result in a range containing a clustering key and a prefix,
incorrectly ordered by the prefix equality or lexicographical
comparators.

Refs #1446

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-22 17:55:34 +02:00
Duarte Nunes
084b931457 bounds_view: Create from nonwrapping_range
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-22 17:52:36 +02:00
Duarte Nunes
878927d9d2 range_tombstone: Extract out bounds_view
This patch extracts bounds_view from range_tombstone so its comprator
can be reused elsewhere.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-22 17:52:36 +02:00
Pekka Enberg
9d1d8baf37 dist/docker: Separate supervisord config files
Move scylla-server and scylla-jmx supervisord config files to separate
files and make the main supervisord.conf scan /etc/supervisord.conf.d/
directory. This makes it easier for people to extend the Docker image
and add their own services.

Message-Id: <1471588406-25444-1-git-send-email-penberg@scylladb.com>
2016-08-22 17:20:23 +03:00
Avi Kivity
55f2cf1626 thrift: do not generate wrapping ring_position ranges
As part of the move to unwrap ranges, don't generate wrapping ranges from
thrift.  A little extra motivation is to avoid the need for the solution
to #1573 to be able to handle wrapping ranges.

This patch may also be fixing a bug in that the range (token, token] was
previously translated as (-inf, +inf), while now it is translated as
{(token, +inf), (-inf, token]}; the new translation respects ordering
better.

Reviewed-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1471869587-12972-1-git-send-email-avi@scylladb.com>
2016-08-22 14:43:45 +01:00
Avi Kivity
46caff5b06 Merge "Store frozen_mutation in fragmented buffer" from Paweł
"This series switches frozen_mutations and to use bytes_ostream
internally so that the size of a single allocation is bounded.
Deserializers are also enhanced so that they can cope with reading
from fragmented buffers.

The goal of the change is to reduce memory pressure in case of
large partitions.

Performance as measured by perf_simple_query (median of 30).

            before       after      diff
read     705270.74   702906.35     -0.3%
write    814504.81   836462.33     +2.7%

Refs #1440.
Refs #1545.
Fixes #1546."
2016-08-22 13:01:34 +03:00
Paweł Dziepak
1315090bf0 query-result: no need to linearize buffer any more
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
3fe5ed3cd9 query: use result_view::consume() where appropriate
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
cb2a557cf7 query::result: reduce chunk count
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
ea3ac0a270 frozen_mutation: reduce chunk count in constructor
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
1daf4c73a3 frozen_mutation: avoid buffer linearization and copy
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
3707d7fec3 frozen_mutation: use bytes_ostream internally
Unlike bytes, bytes_ostream supports fragmented buffers, thus reducing
the pressure on the memory allocator caused by large frozen partitions.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
c0425b63ff frozen_mutation: add mutation_view()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
89f7b46f61 idl: switch to utils::input_stream
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
dcf794b04d idl: make bytes compatible with bytes_ostream
This patch makes idl type "bytes" compatible with both bytes and
bytes_ostream.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
2c5ec44281 atomic_cell: add overloads taking const bytes&
Deserialization code is going to use a proxy object that will be casted
to either bytes or bytes_ostream depending on the demand. It cannot be
casted directly to bytes_view though as it won't extend the lifetime of
the buffer appropriately. The simples solution is just to add overloads
that accept const bytes&.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
387434a76c bytes_ostream: add reduce_chunk_count()
Deserialization code has now two variants. The faster one can be used
only when the source buffer is not fragmented. reduce_chunk_count() aims
to increase number of cases when the fast path can be used.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
7d4b7fd5fc bytes_ostream: add equality operator
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
222bde7e6f bytes_ostream: introduce upper bound on chunk size
This patch makes append() and write() limit the maximum size of a single
allocation to bytes_ostream::max_chunk_size.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
0ee98ea4c4 tests: add fragmented input stream test
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
e76203c927 utils: add input_stream
input_stream performs a type erasure on seastar::simple_input_stream and
fragmented_input_stream. The main goal is to keep the overhead for the
cases when simple_input_stream is used minimum.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Paweł Dziepak
29827a9726 utils: add fragmented_input_stream
fragmented_input_stream is an input stream usable by IDL-generated
deserializers which can read from fragmented buffers.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-22 09:31:33 +01:00
Asias He
4ffd867ad0 gossip: Add log when cluster or partioner mismatch
It is easier for user to figure out the configuration error.

The log looks like:

   WARN  2016-08-22 15:04:56,214 [shard 0] gossip - ClusterName mismatch
   from 127.0.0.2 test2!=test

   WARN  2016-08-22 15:06:16,106 [shard 0] gossip - Partitioner mismatch from 127.0.0.2
   org.apache.cassandra.dht.RandomPartitioner!=org.apache.cassandra.dht.Murmur3Partitioner

Fixes: #1587

Message-Id: <745ed8857da6f70745735b94eef7b226d2f22e10.1471849834.git.asias@scylladb.com>
2016-08-22 11:06:31 +03:00
Raphael S. Carvalho
77d4cd21d7 sstables: Fix estimation of pending tasks for leveled strategy
There were two underflow bugs.

1) in variable i, causing get_level() to see an invalid level and
throw an exception as a result.
2) when estimating number of pending tasks for a level.

Fixes #1603.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <cce993863d9de4d1f49b3aabe981c475700595fc.1471636164.git.raphaelsc@scylladb.com>
2016-08-22 10:37:15 +03:00
Vlad Zolotarov
92921fe110 tracing::trace_state: push the UUID to the end of an error message in a destructor
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1471780783-25406-1-git-send-email-vladz@cloudius-systems.com>
2016-08-21 16:50:52 +03:00
Vlad Zolotarov
0683d4bd29 tracing::trace_state: don't throw in a destructor
The condition in question is sanity check for a SW bug.

This SW bug (if occurs) is not critical - there is an additional protection
against it in the stop_foreground_and_write().

Having said all that, since we shell not throw from a destructor,
replace throwing of a std::logic_error with an logger error message.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1471773320-7398-1-git-send-email-vladz@cloudius-systems.com>
2016-08-21 13:50:52 +03:00
Avi Kivity
2dcbbf9006 Merge seastar upstream
* seastar 4254111...6fadd98 (6):
  > simple_input_stream: remove explicit copy constructor
  > simple_stream: add [[gnu::always_inline]]
  > perf_fstream: don't initialize variable-size array inline
  > rpc: annotate template function calls with "template"
  > iotune: avoid "auto" parameters
  > file: ext4 support
2016-08-21 13:38:43 +03:00
Paweł Dziepak
e60bb83688 sstables: optimise clustering rows filtering
Clustering rows in the sstables are sorted in the ascending order so we
can use that to minimise number of comparisons when checking if a row is
in the requested range.

Refs #1544.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Reviewed-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <1471608921-30818-1-git-send-email-pdziepak@scylladb.com>
2016-08-19 18:11:11 +03:00
Pekka Enberg
2bf5e8de6e dist/docker: Use Scylla mascot as the logo
Glauber "eagle eyes" Costa pointed out that the Scylla logo used in our
Docker image documentation looks broken because it's missing the Scylla
text.

Fix the problem by using the Scylla mascot instead.
Message-Id: <1471525154-2800-1-git-send-email-penberg@scylladb.com>
2016-08-19 12:50:02 +03:00
Pekka Enberg
4d90e1b4d4 dist/docker: Fix bug tracker URL in the documentation
The bug tracker URL in our Docker image documentation is not clickable
because the URL Markdown extracts automatically is broken.

Fix that and add some more links on how to get help and report issues.
Message-Id: <1471524880-2501-1-git-send-email-penberg@scylladb.com>
2016-08-19 12:49:52 +03:00
Yoav Kleinberger
25fb5e831e docker: extend supervisor capabilities
allow user to use the `supervisorctl' program to start and stop
services. `exec` needed to be added to the scylla and scylla-jmx starter
scripts - otherwise supervisord loses track of the actual process we
want to manage.

Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1471442960-110914-1-git-send-email-yoav@scylladb.com>
2016-08-18 15:08:11 +03:00
Avi Kivity
d33d958239 Merge "tracing: cleanups" from Vlad
"This series includes a time stamp representation changes Avi asked.
In addition is fixes a session "duration" semantics to be the time
it took to satisfy the user's request and not a time it took to
achieve the complete replication factor."
2016-08-18 14:36:19 +03:00
Pekka Enberg
1553bec57a dist/docker: Documentation cleanups
- Fix invisible characters to be space so that Markdown to PDF
  conversion works.

- Fix formatting of examples to be consistent.

- Spellcheck.

Message-Id: <1471514924-29361-1-git-send-email-penberg@scylladb.com>
2016-08-18 13:09:37 +03:00
Pekka Enberg
4ca260a526 dist/docker: Document image command line options
This patch documents all the command line options Scylla's Docker image supports.

Message-Id: <1471513755-27518-1-git-send-email-penberg@scylladb.com>
2016-08-18 13:01:58 +03:00
Avi Kivity
42094524e7 Merge seastar upstream
* seastar ab29b12...4254111 (3):
  > file: fix size() return type
  > build: adjust -fsantize=vptr broken warning
  > thread_impl.hh: add missing include
2016-08-18 10:34:58 +03:00
Avi Kivity
d0308ff488 Merge seastar upstream
* seastar 81df893...ab29b12 (1):
  > core: Fix bug in make_file_impl() which affects directory scanning
2016-08-17 21:57:03 +03:00
Amos Kong
9d53305475 systemd: have the first housekeeping check right after start
Issue: https://github.com/scylladb/scylla/issues/1594

Currently systemd run first housekeeping check at the end of
first timer period. We expected it to be run right after start.

This patch makes systemd to be consistent with upstart.

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <4cc880d509b0a7b283278122a70856e21e5f1649.1471433388.git.amos@scylladb.com>
2016-08-17 16:02:00 +03:00
Avi Kivity
0033197ba9 Merge seastar upstream
* seastar 823a404...81df893 (3):
  > memory: Do not increase g_allocs on failure in allocate and allocate_aligned
  > memory: Balance the g_frees and g_allocs
  > Merge "thread: explicitly yield on get()" from Glauber

Fixes #1586.
2016-08-17 13:28:30 +03:00
Avi Kivity
4871b19337 Merge "Fixes for streamed_mutation_from_mutation" from Paweł
"This series contains fixes for two memory leaks in
streamed_mutation_from_mutation.

Fixes #1557."
2016-08-17 13:24:22 +03:00
Avi Kivity
e7eb76fc58 Introduce stdx.hh header file
So we don't have to create an stdx = std::experimental alias everywhere.
Message-Id: <1471417039-21391-1-git-send-email-avi@scylladb.com>
2016-08-17 11:19:49 +01:00
Paweł Dziepak
148e9c5608 streamed_mutation_from_mutation: fix destroying bi::sets
Once unlink_leftmost_without_rebalance() has been called on a bi::set no
other method can be used. This includes clear_and_disposed() used by the
mutation_partition destructor.

We like unlink_leftmost_without_rebalance() because it is efficient, so
the solution is to manually finish destroying clustering row and range
tombstone sets in the reader destructor using that function.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-17 11:03:59 +01:00
Paweł Dziepak
fe9575d01d streamed_mutation_from_mutation: fix leak on allocation failure
mutation_fragment() constructor allocates memory. If it fails the
already unlinked parts of mutation (either rows_entry or range_tombtone)
will be leaked.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-08-17 11:02:24 +01:00
Benoit Canet
90ef150ee9 systemd: Remove WorkingDirectory directive
The WorkingDirectory directive does not support environment variables on
systemd version that is shipped with Ubuntu 16.04. Fortunately, not
setting WorkingDirectory implicitly sets it to user home directory,
which is the same thing (i.e. /var/lib/scylla).

Fixes #1319

Signed-of-by: Benoit Canet <benoit@scylladb.com>
Message-Id: <1470053876-1019-1-git-send-email-benoit@scylladb.com>
2016-08-17 12:34:11 +03:00
Raphael S. Carvalho
108fd1fade database: close file in lister
After listing is done, let's close file. This fixes no bug.
It's only an improvement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2f52d297bcf6a6b6e3429912c28f17e6b37f8842.1471381607.git.raphaelsc@scylladb.com>
2016-08-17 11:01:44 +03:00
Glauber Costa
b361dee488 database: memtables pending flushes tell us nothing
We have two counters that tracks how many memtable flushes are in progress, and
how much memory are they pinning.

The problem is, after we have revamped the code to limit the amount of flushes
in progress, those counters became useless: as they live inside the semaphore
side, they will only be incremented once we have past the semaphore.

One wouldn't notice if working with CPU-bound problems, where memtables don't
pile. But as soon as they do, those counters will always show the same numbers:
the depth of the semaphore, which doesn't mean much. The problem is poised to
become much worse: once we enable write behind in full and set the semaphore's
depth to one, that's the number we'll see here all the time.

The fix is to move the counters outside the semaphore, which will bring back its
old semantics.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c5ae6903e170f3f356cdda7ed78a4c9ba8d5f024.1471370504.git.glauber@scylladb.com>
2016-08-17 10:54:15 +03:00
Piotr Jastrzebski
bb0c4c3c40 Fix compilation errors
query::range parameter in mutation_partiton::range
has to be changed to nonwrapping_range.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <36e444bfe90586f8d3b08ca36d8dc13d5898ef97.1471347402.git.piotr@scylladb.com>
2016-08-16 12:49:54 +01:00
Vlad Zolotarov
37da6f53f8 tracing: fix a session "duration" semantics
A session's "duration" should be a time it took to
handle a request, which is a time till response to a user.
In other words - till a consistency level is reached.

Before this patch is was a time that takes a complete
handling of a request, which is the time it takes to handle
all replicas and not only those required to reach a CL.

This patch fixes this situation by extending the trace_state's state
values to 3 states: inactive, foreground and background.

A primary session may be in 3 states:
  - "inactive": between the creation and a begin() call.
  - "foreground": after a begin() call and before a
    stop_foreground_and_write() call.
  - "background": after a stop_foreground_and_write() call and till the
    state object is destroyed.

- Traces are not allowed while state is in an "inactive" state.
- The time the primary session was in a "foreground" state is the time
  reported as a session's "duration".
- Traces that have arrived during the "background" state will be recorded
  as usual but their "elapsed" time will be greater or equal to the
  session's "duration".

Secondary sessions may only be in an "inactive" or in a "foreground"
states.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-16 12:32:34 +03:00
Vlad Zolotarov
f83e33fc13 tracing: make "elapsed" be std::chrono::duration
- Define an tracing::elapsed_clock type (std::chrono::steady_clock).
     Use it instead of trace_state::clock_type.
   - Store the "elapsed" information in a form of elapsed_clock::duration.
   - Make all keyspace_backend specific conversions inside the trace_keyspace_helper
     class, where they belong.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-16 12:32:34 +03:00
Vlad Zolotarov
ebf13da9c9 tracing::session_record: make start_at to be a time_point
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-16 12:32:34 +03:00
Vlad Zolotarov
5e24f6f442 trace_keyspace_helper::apply_events_mutation(): Avoid extra std::move of std::deque
events_records are promised to be kept alive till the future returned
by apply_events_mutation() resolves: it's dowithificated by a caller already.

In addition, since its passed by a reference, it's a logical thing to demand
it to be kept alive by a caller till the future above resolves.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-16 12:32:34 +03:00
Avi Kivity
bf02ca831d Merge "Tracing: change a back pressure scheme" from Vlad
"This series changes the tracing back pressure scheme from limiting the amount traces in
a single session by a fixed number to have a per-shard budget consumed by all active tracing
sessions.

It was really easy to cause the traces to be dropped even if there weren't too many
active traces: e.g. if there was a single active session which creates more traces
than a per-session limit (30) the traces above 30-th were going to be dropped. Namely
traces were dropped when there were only 30 active traces, which is ridiculous.

This series introduces two main changes:
   - Changes the records budgeting from being per-session to be per-shard. This substantially
     increases the amount of active records after which new records are going to be dropped.
   - Introduces a flow when events' records are written BEFORE the corresponding tracing
     session is over (right now traces are written to I/O back end only when the session object
     is destroyed).

The later is meant to virtually eliminate the traces drops in normal situations at all.
Of course, if a back end is slow or if there are a lot of small sessions that do not complete we would still have
to drop new sessions/records in order to avoid uncontrolled growth of a memory foot print of Tracing.

If we see the later case happening a lot in the future we may add lowres timers to each session that would
commit the cached records for writing every X time. But let's not try to optimize something that we
are not completely sure has to be optimized... "
2016-08-16 12:21:02 +03:00
Amnon Heiman
0706db9387 API: use the estimated sum when converting histogram to json
The function that convert histogram to the json histogram object need to
use the estimated_sum to get the actual sum and not the sampled sum.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1467547341-30438-3-git-send-email-amnon@scylladb.com>
2016-08-16 11:06:51 +03:00
Amnon Heiman
4c14b2a527 histogram: Add an estimated sum method
The histogram implementation uses sampling to estimate the mean and sum.
This patch adds a method that returns an estimated sum based on the mean
and the total number of events measured.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1467547341-30438-2-git-send-email-amnon@scylladb.com>
2016-08-16 11:06:50 +03:00
Pekka Enberg
fde8677f1a cql3/query_processor: Clean up code formatting
Currently, query_processor.cc code formatting is all over the place,
which makes the file hard to read. Apply some formatting magic to make
it prettier.

Message-Id: <1470832486-26020-2-git-send-email-penberg@scylladb.com>
2016-08-16 10:39:15 +03:00
Pekka Enberg
ce07822f49 cql3/query_processor: Use type deduction to make code more readable
Use the 'auto' specifier for variables and lambda parameters to make the
code more readable.

Message-Id: <1470832486-26020-1-git-send-email-penberg@scylladb.com>
2016-08-16 10:39:11 +03:00
Avi Kivity
b1f9688432 Merge "range: Add nonwrapping_range" from Duarte
"Ranges that wrap around are a source of complexity and bugs. This patchset
adds a nonwrapping_range class, which specifies the range can't wrap around.
It is the user of the nonwrapping_range that is required to enforce this
constraint.

The idea is to incrementaly disallow ranges that wrap around. We do it
for query::clustering_range in this patchset, and it can be done similarly
for other ranges. This moves the burden of unwrapping ranges to the edges.

Fixes #1544"
2016-08-16 10:08:24 +03:00
Duarte Nunes
5161ea283f query: query::clustering_range can't wrap around
This patch changes the type of query::clustering_range to express that
ranges that wrap around are not allowed, and ranges that have the
start bound after the end bound are considered empty.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-15 14:50:20 +00:00
Duarte Nunes
3275fabe53 storage_proxy: Short circuit query without clustering ranges
This patch makes the storage_proxy return an empty result when the
query doesn't define any clustering ranges (default or specific).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-15 14:48:57 +00:00
Duarte Nunes
56f10abce3 thrift: Don't always validate clustering range
This patch makes make_clustering_range not enforce that the range be
non-wrapping, so that it can be validated differently if needed. A
make_clustering_range_and_validate function is introduced that keeps
the old behavior.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-15 14:48:57 +00:00
Duarte Nunes
be4adf212a nonwrapping_range: Add unit tests
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-15 14:48:57 +00:00
Duarte Nunes
bb16e194bc range: Add nonwrapping_range class
This patch introduces the nonwrapping_range class. This class is
intended to be used by code that requires non wrapping ranges.
Internally, it uses a wrapping_range. Users are responsible for
ensuring the bounds are correct when creating a nonwrapping_range.

The path proposed here is to incrementally replace usages of
wrapping_range/range by nonwrapping_range, pushing usages of wrapping
ranges as further to the edges as possible.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-15 14:48:57 +00:00
Duarte Nunes
2bb428973a range: Rename to wrapping_range
This patch renames range to wrapping_range in preparation for adding a
new range type, nonwrapping_range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-15 14:48:57 +00:00
Duarte Nunes
2c0b049176 clustering_key_filter: Don't forward declare range
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-15 14:48:57 +00:00
Paweł Dziepak
5cae44114f partition_version: handle errors during version merge
Currently, partition snapshot destructor can throw which is a big no-no.
The solution is to ignore the exception and leave versions unmerged and
hope that subsequent reads will succeed at merging.

However, another problem is that the merge doesn't use allocating
sections which means that memory won't be reclaimed to satisfy its
needs. If the cache is full this may result in partition versions not
being merged for a very long time.

This patch introduces partition_snapshot::merge_partition_versions()
which contains all the version merging logic that was previously present
in the snapshot destructor. This function may throw so that it can be
used with allocating sections.

The actual merging and handling of potential erros is done from
partition_snapshot_reader destructor. It tries to merge versions under
the allocating section. Only if that fails it gives up and leaves them
unmerged.

Fixes #1578
Fixes #1579.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1471265544-23579-1-git-send-email-pdziepak@scylladb.com>
2016-08-15 15:56:53 +03:00
Asias He
ef782f0335 gossip: Add heart_beat_version to collectd
$ tools/scyllatop/scyllatop.py '*gossip*'

node-1/gossip-0/gauge-heart_beat_version 1.0
node-2/gossip-0/gauge-heart_beat_version 1.0
node-3/gossip-0/gauge-heart_beat_version 1.0

Gossip heart beat version changes every second. If everyting is working
correctly, the gauge-heart_beat_version output should be 1.0. If not,
the gauge-heart_beat_version output should be less than 1.0.

Message-Id: <cbdaa1397cdbcd0dc6a67987f8af8038fd9b2d08.1470712861.git.asias@scylladb.com>
2016-08-15 12:32:00 +03:00
Nadav Har'El
0d00da7f7f sstables: don't forget to read static row
[v2: fix check for static column (don't check if the schema is not compound)
 and move want-static-columns flag inside the filtering context to avoid
 changing all the callers.]

When a CQL request asks to read only a range of clustering keys inside
a partition, we actually need to read not just these clustering rows, but
also the static columns and add them to the response (as explained by Tomek
in issue #1568).

With the current code, that CQL request is translated into an
sstable::read_row() with a clustering-key filter. But this currently
only reads the requested clustering keys - NOT the static columns.

We don't want sstable::read_row() to unconditionally read the from disk
the static columns because if, for example, they are already cached, we
might not want to read them from disk. We don't have such partial-partition
cache yet, but we are likely to have one in the future.

This patch adds in the clustering key filter object a flag of whether we
need to read the static columns (actually, it's function, returning this
flag per partition, to match the API for the clustering-key filtering).

When sstable::read_row() sees the flag for this partition is true, it also
request to read the static columns.
Currently, the code always passes "true" for this flag - because we don't
have the logic to cache partially-read partitions.

The current find_disk_ranges() code does not yet support returning a non-
contiguous byte range, so this patch, if it notices that this partition
really has static columns in addition to the range it needs to read,
falls back to reading the entire partition. This is a correct solution
(and fixes #1568) but not the most efficient solution. Because static
columns are relatively rare, let's start with this solution (correct
by less efficient when there are static columns) and providing the non-
contiguous reading support is left as a FIXME.

Fixes #1568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1471124536-19471-1-git-send-email-nyh@scylladb.com>
2016-08-15 12:30:19 +03:00
Avi Kivity
4fcebd4ca6 random_partitioner: fix overflow in shard_of()
uint128_t will overflow if smp::count > 2.  Replace with a larger type.

Message-Id: <1471188765-30142-1-git-send-email-avi@scylladb.com>
2016-08-15 09:41:54 +03:00
Amnon Heiman
612f677283 scylla.spec: conditionally include the housekeeping.cfg in the conf package
When the housekeeping configuration name was changed from conf to cfg it
was no longer included as part of the conf rpm.

This change adds a macro that determines of if the file should be
included or not and use that marco to conditionally add the
configuration file to the rpm.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1471169042-19099-1-git-send-email-amnon@scylladb.com>
2016-08-14 13:25:59 +03:00
Tomasz Grabiec
1b2ea14d0e partition_version: Add missing linearization context
Snapshot removal merges partitions, and cell merging must be done
inside linearization context.

Fixes #1574

Reviewed-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1471010625-18019-1-git-send-email-tgrabiec@scylladb.com>
2016-08-12 17:55:23 +03:00
Piotr Jastrzebski
f212a6cfcb Fix after free access bug in storage proxy
Due to speculative reads we can't guarantee that all
fibers started by storage_proxy::query will be finished
by the time the method returns a result.

We need to make sure that no parameter passed to this
method ever changes.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <31952e323e599905814b7f378aafdf779f7072b8.1471005642.git.piotr@scylladb.com>
2016-08-12 16:34:43 +02:00
Duarte Nunes
918a2939ff docker: If set, broadcast address is seed
This patch configures the broadcast address to be the seed if it is
configured, otherwise Scylla complains about it and aborts.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470863058-1011-1-git-send-email-duarte@scylladb.com>
2016-08-12 11:46:50 +03:00
Avi Kivity
21392cf5fd Merge seastar upstream
* seastar 7fd8d49...823a404 (1):
  > io_priority_class: remove non-explicit operator unsigned
2016-08-11 17:20:23 +03:00
Avi Kivity
65aa9135a1 Merge seastar upstream
* seastar 59613e7...7fd8d49 (1):
  > reactor: Do not test for poll mode default
2016-08-11 14:46:45 +03:00
Amnon Heiman
5a4fc9c503 scylla-housekeeping: rename configuration file from conf to cfg
Files with a conf extension are run by the scylla_prepare on the AMI.
The scylla-housekeeping configuration file is not a bash script and
should not be run.

This patch changes its extension to cfg which is more python like.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470896759-22651-2-git-send-email-amnon@scylladb.com>
2016-08-11 14:44:56 +03:00
Tomasz Grabiec
f1c2481040 sstables: Fix bug in promoted index generation
maybe_flush_pi_block, which is called for each cell, assumes that
block_first_colname will be empty when the first cell is encountered
for each partition.

This didn't hold after writing partition which generated no index
entry, because block_first_colname was cleared only when there way any
data written into the promoted index. Fix by always clearing the name.

The effect was that the promoted index entry for the next partition
would be flushed sooner than necessary (still counting since the start
of the previous partition) and with offset pointing to the start of
the current partition. This will cause parsing error when such sstable
is read through promoted index entry because the offset is assumed to
point to a cell not to partition start.

Fixes #1567

Message-Id: <1470909915-4400-1-git-send-email-tgrabiec@scylladb.com>
2016-08-11 13:08:48 +03:00
Amnon Heiman
a24941cc5f build_deb: Add dist flag
The dist flag mark the debian package as distributed package.
As such the housekeeping configuration file will be included in the
package and will not need to be created by the scylla_setup.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470907208-502-2-git-send-email-amnon@scylladb.com>
2016-08-11 12:25:07 +03:00
Pekka Enberg
d1a052237d dist/docker: Fix typo in "--overprovisioned" help text
Reported by Mathias Bogaert (@analytically).
Message-Id: <1470904395-4614-1-git-send-email-penberg@scylladb.com>
2016-08-11 11:38:03 +03:00
Nadav Har'El
7409688356 README.md: add another required package
I tried to compile scylladb on a new Fedora 24 system, and the "-lsystemd"
library was missing during like. We need the systemd-devel package for that.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470865063-12871-1-git-send-email-nyh@scylladb.com>
2016-08-11 10:21:21 +02:00
Avi Kivity
42d8701121 Merge seastar upstream
* seastar 64ae228...59613e7 (20):
  > reactor: fix I/O queue pending requests collectd metric
  > simple_input_stream: introduce copy_to() member function
  > scollectd: Fix merge skew between "disabled" and "descriped" metrics patches
  > Merge "Write-behind for XFS"
  > Merge "Fix the SMP queue poller" from Tomasz
  > Merge "collectd syntactical sugar & descriptions" from Calle
  > Fix build failure introduced by 5b1051ce0de5b772f22444abe6a1d97076b49b1f
  > Add --abort-on-seastar-bad-alloc option
  > Merge "Make arp comply with C++ strict aliasing rules"
  > doc: use install-dependencies.sh on docs
  > add execute bit on install-dependencies.sh
  > core/reactor: Fix use-after-free on io_event's promise
  > tcp: write option length correctly
  > tcp: make tcp options comply with strict aliasing rules
  > tcp: comply with strict aliasing rules
  > add libdl to library list
  > reactor: add exception counter
  > install-dependencies.sh: remove unnecessary sudo
  > install-dependencies.sh: install add-apt-repository when it's not installed
  > install-dependencies.sh: add protobuf to dependencies, for newly added prometheus API support

Fixes #1558.
2016-08-10 15:14:31 +03:00
Tomasz Grabiec
d7f8ce7722 Merge branch 'raphael/fix_min_max_metadata_v2' from git@github.com:raphaelsc/scylla.git
Fix for generation of sstables min/max clustering metadata from Raphael.
2016-08-10 10:43:35 +02:00
Pekka Enberg
6a5ab6bff4 dist/docker: Add '--smp', '--memory', and '--overprovisoned' options
Add '--smp', '--memory', and '--overprovisioned' options to the Docker
image. The options are written to /etc/scylla.d/docker.conf file, which
is picked up by the Scylla startup scripts.

You can now, for example, restrict your Docker container to 1 CPU and 1
GB of memory with:

   $ docker run --name some-scylla penberg/scylla --smp 1 --memory 1G --overprovisioned 1

Needed by folks who want to run Scylla on Docker in production.

Cc: Sasha Levin <alexander.levin@verizon.com>
Message-Id: <1470680445-25731-1-git-send-email-penberg@scylladb.com>
2016-08-10 11:34:08 +03:00
Raphael S. Carvalho
8deb1ca19d tests: add test to check sstables's min and max clustering values
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-08-09 15:54:40 -03:00
Raphael S. Carvalho
ef6ddf2398 sstables: fix tracking of min and max clustering components
Scylla was tracking min and max column names instead. Min and max
clustering components are tracked to optimize reads that use a
clustering filter. For more details:
https://issues.apache.org/jira/browse/CASSANDRA-5514

Also fix potential bug if clustering value is empty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-08-09 15:01:30 -03:00
Vlad Zolotarov
5deec0e327 tracing::write_complete(): improve a message in case of a logic error
Improve a message if there is a logic error and add logging of such
errors.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-09 19:00:43 +03:00
Vlad Zolotarov
67d537ecb5 tracing: issue a write event if a single session creates a lot of events
Currently write events are issued every time a trace session is closed.
However if a single session creates a lot of events we will start dropping them
after the total amount of pending records bypasses the limit.

This patch will issue a write event before the session end in that case.

Since now new events may be added to the active tracing session while it's
scheduled for write we have to ensure the following:
   - Not to add the already pending for write session to the pending bulk.
   - Grab all pending data in a specific session in a synchronous way during
     the write event.
   - Serialize creation of events mutations - otherwise the "monotonic nanos"
     logic won't work.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-09 19:00:43 +03:00
Vlad Zolotarov
5391bcc5a9 tracing: improve a back pressure policy
Use a per-shard tracing records budget instead
of maintaining a fixed-size per-session records budget and
a per-shard sessions budget.

The original policy could lead to some irrational situations,
when we have a single tracing session that creates a substantial
amount of records that we can handle but we would start dropping
new records after it surpasses the per-session limit.

The new policy handles a per-shard trace records budget that is
being consumed by each trace() call and by a primary session destructor
when a session record is created.

Each active record may only be in one of the following states:
   - cached: stored in its session's object. When record is in this state
             it's not going to be written to I/O during the next write event.
   - pending for write: when record is in this state it's going to be written
             to I/O during the next write event.
   - flushing: the record is being currently written to the I/O.

There are counters of the total amount of records in each state above.

Each record may only be in a specific state at every point of time and
thereby it must be accounted only in one and only one of the three
counters.

The sum of all three counters should not be greater than
(max_pending_trace_records + write_event_records_threshold) at any time
(actually it can get as high as a value above plus (max_pending_sessions)
if all sessions are primary but we won't take this into an account for
simplicity).

The same is about the number of outstanding sessions: it may not be greater
than (max_pending_sessions + write_event_sessions_threshold) at any time.

If total number of tracing records is greater or equal to the limit
above, the new trace point is going to be dropped.

If current number or records plus the expected number of trace records
per session (exp_trace_events_per_session) is greater than the limit
above new sessions will be dropped. A new session will also be dropped if
there are too many active sessions.

When the record or a session is dropped the appropriate statistics
counters are updated and there is a rate-limited warning message printed
to the log.

Every time a number of records pending for write is greater or equal to
(write_event_records_threshold) or a number of sessions pending for
write is greater or equal to (write_event_sessions_threshold) a write
event is issued.

Every 2 seconds a timer would write all pending for write records
available so far.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-09 19:00:43 +03:00
Vlad Zolotarov
d8fe5317d1 tracing::trace_keyspace_helper: make events' mutations applying loop interruptible
When building events' mutation don't apply them in a tight loop
but rather apply each of them in a separate continuation to allow
reactor to interrupt this loop if it takes too long for it to
complete (e.g. where there are a lot of mutations to apply).

Since building all events' mutations is asynchronous now we can
no longer keep the "nanos" state in a global trace_keyspace_helper
object but rather have to move it into the per-session
backend_session_state class.

backend_session_state class is a backend-specific implementation of a
tracing::backend_session_state_base class.

An instance of the above object is created by a
tracing::i_tracing_backend_helper::allocate_session_state() virtual
method and is stored in a tracing::one_session_records object.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-09 19:00:39 +03:00
Vlad Zolotarov
63a0502ed1 tracing: rework the interface between the tracing/trace_state and the backend
Before this patch the interaction between the layers above was as follows:
   - trace_state was passing the trace event data to a backend object every
     time trace() method was called.
   - trace_state was passing the session data to a backend object in a destructor.
   - A backend object was storing this data in a form of lambda where all data
     above was caught in a capture list. This was primarily done in order to
     delay the call for make_xxx_mutation(). Lambdas were stored in a map by a session
     ID and they were executed when a kick() method was called.
   - A tracing::tracing object was periodically calling a kick() method of a
     backend that was initiating a write of all pending data to the storage.

All backend methods used in the described above interactions were virtual.
Thereby, for instance, for each and every trace record we were calling a virtual method that was
receiving a significant amount of parameters, store a lambda in a map and return.
This is clearly a suboptimal way of using virtual functions since we prevent a compiler
from inlining an obviously inlinable operations.

This patch changes the interaction scheme to be as follows:
   - Trace events and session data are stored and passed around in a form of structs
     that hold all relevant information (no more lambdas).
   - As long as a trace session is active its data is aggregated inside the corresponding
     trace_state object.
   - The object containing all records is passed and stored as a lw_shared_ptr to save extra
     copies and to shorten capture lists.
   - All aggregated data is passed to a tracing::tracing object in a trace_state destructor.
     The data is stored in a std::deque in a tracing::tracing object (instead of a map by a session ID).
   - A single backend's virtual method call writes all data aggregated so far (kick()
     method is not needed any more), every time a write event occurs.
   - Backend has only one virtual method now:
      - Write a bulk of sessions' data aggregated so far.
   - Backend's virtual method receives a records bulk object by reference.

As a result:
   - A latency of a single trace event that has no formatting improved from 0.2us to 0.1us.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-09 15:25:52 +03:00
Vlad Zolotarov
960b423ce0 tracing/tracing.cc: rename a logger object
s/logger/tracing_logger/

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-09 15:21:47 +03:00
Nadav Har'El
c2e4f5ba16 Avoid some warnings in debug build
The sanitizer of the debug build warns when a "bool" variable is read when
containing a value not 0 or 1. In particular, if a class has an
uninitialized bool field, which class logic allows to only be set later,
then "move"ing such an object will read the uninitialized value and produce
this warning.

This patch fixes four of these warnings seen in sstable_test by initializing
some bool fields to false, even though the code doesn't strictly need this
initialization.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com>
2016-08-09 13:21:45 +01:00
Vlad Zolotarov
e1b2926a8d tracing: add a missing try-catch in params building
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-09 15:21:41 +03:00
Duarte Nunes
0ed19ec64d thrift: Set default validator
This patch sets the default validator for dynamic column families.
Doing so has no consequences in terms of behavior, but it causes the
correct type to be shown when describing the column family through
cassandra-cli.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470739773-30497-1-git-send-email-duarte@scylladb.com>
2016-08-09 13:55:07 +02:00
Avi Kivity
da4d33802e Merge "Add configuration file to scylla-housekeeping" from Amnon
"The series adds an optional configuration file to the scylla-housekeeping. The
file act as a way to prevent the scylla-housekeeping to run. A missing
configuration file, will make the scylla-housekeeping immediately.

The series adds a flag to the build_rpm that differentiate between public
distributions that would contain the configuration file and private
distributions that will not contain it which will cause the setup script to
create it."
2016-08-09 14:52:19 +03:00
Nadav Har'El
e005762271 sstable: avoid copying non-existant value
The promoted-index reading code contained a bug where it copied the value
of an disengaged optional (this non-value was never used, but it was still
copied ). Fix it by keeping the optional<> as such longer.

This bug caused tests/sstable_test in the debug build to crash (the release
build somehow worked).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470742418-8813-1-git-send-email-nyh@scylladb.com>
2016-08-09 14:35:18 +03:00
Duarte Nunes
f63886b32e thrift: Send empty col metadata when describing ks
This patch ensures we always send the column metadata, even when the
column family is dynamic and the metadata is empty, as some clients
like cassandra-cli always assume its presence.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470740971-31169-1-git-send-email-duarte@scylladb.com>
2016-08-09 14:33:18 +03:00
Pekka Enberg
9ff242d339 cql3: Filter compaction strategy class from compaction options
Cassandra 2.x does not store the compaction strategy class in compaction
options so neither should we to avoid confusing the drivers.

Fixes #1538.
Message-Id: <1470722615-29106-1-git-send-email-penberg@scylladb.com>
2016-08-09 10:38:37 +02:00
Pekka Enberg
c23acbe5e6 Update scylla-ami submodule
* dist/ami/files/scylla-ami 2e599a3...14c1666 (1):
  > setup coredump on first startup
2016-08-09 11:09:24 +03:00
Nadav Har'El
bce020efbd Fix failing tests
Commit 0d8463aba5 broke some of the tests with an assertion
failure about local_is_initialized(). It turns out that there is more than
one level of local_is_initialized() we need to check... For some tests,
neither locals were initialized, but for others, one was and the other
wasn't, and the wrong one was tested.

With this patch, all unit tests except "flush_queue_test.cc" pass on my
machine. I doubt this test is relevant to the promoted index patches,
but I'll continue to investigate it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com>
2016-08-09 10:51:40 +03:00
Calle Wilund
1593f4254b flush_queue_test: Start semaphore in propagation tests not initialized
Somehow, all my local runs timed ok anyway, but obviously not on all
machines.
Message-Id: <1470727968-1759-1-git-send-email-calle@scylladb.com>
2016-08-09 09:35:28 +02:00
Asias He
d8bff4f745 gossip: Fix debug log in wait_for_gossip_to_settle
There is an extra '{}' in the logger format string.

Fixes:

gossip - Gossip looks settled. 8 gossip round completed: ???

Message-Id: <1470278008-29914-2-git-send-email-asias@scylladb.com>
2016-08-08 16:38:21 +03:00
Takuya ASADA
60ce16cd54 dist/common/scripts: mkdir -p /var/lib/scylla/coredump before symlinking
We are creating this dir in scylla_raid_setup, but user may create XFS volume w/o using the command, scylla_coredump_setup should work on such condition.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1470638615-17262-1-git-send-email-syuu@scylladb.com>
2016-08-08 16:02:55 +03:00
Pekka Enberg
3b31d500c8 dist/docker: Document data volume and cpuset configuration
Message-Id: <1470649675-5648-1-git-send-email-penberg@scylladb.com>
2016-08-08 15:50:30 +03:00
Pekka Enberg
4372da426c dist/docker: Add '--broadcast-rpc-address' command line option
We already have a '--broadcat-address' command line option so let's add
the same thing for RPC broadcast address configuration.

Message-Id: <1470656449-11038-1-git-send-email-penberg@scylladb.com>
2016-08-08 15:46:42 +03:00
Avi Kivity
98226a14ac Merge "Exception propagation writers in commitlog batch"
"
While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.

Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.

Flush queue can now (optionally) propagate exceptions to all clients, and
commit log uses this to ensure that commit log writers in batch mode
all generate exceptions on disk errors.

Also includes some rudimentary tests for flush queue mechanisms.

Note: other main user, sstable flushing, is not affected, as default
mode is still to keep exceptions to individual worker continuations,
not waiters."
2016-08-08 15:33:26 +03:00
Avi Kivity
700feda0db Merge "promoted index for reading partial partitions" from Nadav
"The goal of this patch series is to support reading and writing of a
"promoted index" - the Cassandra 2.* SSTable feature which allows reading
only a part of the partition without needing to read an entire partition
when it is very long. To make a long story short, a "promoted index" is
a sample of each partition's column names, written to the SSTable Index
file with that partition's entry. See a longer explanation of the index
file format, and the promoted index, here:

     https://github.com/scylladb/scylla/wiki/SSTables-Index-File

There are two main features in this series - first enabling reading of
parts of partitions (using the promoted index stored in an sstable),
and then enable writing promoted indexes to new sstables. These two
features are broken up into smaller stand-alone pieces to facilitate the
review.

Three features are still missing from this series and are planned to be
developed later:

1. When we fail to parse a partition's promoted index, we silently fall back
   to reading the entire partition. We should log (with rate limiting) and
   count these errors, to help in debugging sstable problems.

2. The current code only uses the promoted index when looking for a single
   contiguous clustering-key range. If the ck range is non-contiguous, we
   fall back to reading the entire partition. We should use the promoted
   index in that case too.

3. The current code only uses the promoted index when reading a single
   partition, via sstable::read_row(). When scanning through all or a
   range of partitions (read_rows() or read_range_rows()), we do not yet
   use the promoted index; We read contiguously from data file (we do not
   even read from the index file, so unsurprisingly we can't use it)."
2016-08-07 17:53:17 +03:00
Avi Kivity
bbaebb39a8 Update scylla-ami submodule
* dist/ami/files/scylla-ami 863cc45...2e599a3 (1):
  > Do not set developer-mode on unsupported instance types
2016-08-07 17:51:24 +03:00
Nadav Har'El
fc063ae62d tests: add test for promoted index writing
In this unit test, we create using Scylla C++ code, the same large
partition with 13520 CQL rows as we previously imported from Cassandra
for the large partition test. We then verify that the sstable index file
we just wrote is byte-for-byte identical to the one previously created by
Cassandra. They should indeed be identical, because the data file has the
same layout (even if timestamps are different) and our default promoted-
index block size is the same (64K) so the sample of columns should be
identical.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:15 +03:00
Nadav Har'El
0d8463aba5 sstables: promoted index write support
This patch adds writing of promoted index to sstables.

The promoted index is basically a sample of columns and their positions
for large partitions: The promoted index appears in the sstable's index
file for partitions which are larger than 64 KB, and divides the partition
to 64 KB blocks (as in Cassandra, this interval is configurable through
the column_index_size_in_kb config parameter). Beyond modifying the index
file, having a promoted index may also modify the data file: Since each
of blocks may be read independently, we need to add in the beginning of
each block the list of range tombstones that are still open at that
position.

See also https://github.com/scylladb/scylla/wiki/SSTables-Index-File

Fixes #959

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:12 +03:00
Nadav Har'El
09ede5f333 range tombstone accumulator: add method
Add to the range_tombstone_accumulator a range_tombstones_for_row(ck)
method.

Just like the existing tombstone_for_row(ck), this function drops from
the accumulator tombstones that end before ck. But while the existing
function returned just a single tombstone affecting the given row (the
most recent tombstone), the new function range_tombstones_for_row(ck)
returns all the accumulated range tombstones which cover ck.

This function will be useful for the promoted-index writing code later,
which divides a partition into blocks which may be read independently,
so each block needs to start with a repeat of the earlier tombstones
which still cover the first row in the new block.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:10 +03:00
Nadav Har'El
022a69caea sstables: promoted index read support
This patch adds support more efficiently reading small parts of a large
partition, without reading the entire partition as we had to do so far.
This is done using the "promoted index".

The "promoted index" is stored in the sstable index file, and provides
for each large sstable row ("partition" in CQL nomenclature) a sample of
the column names at (for example) 64KB intervals. This means that when we
read a slice of columns (e.g., cql rows), or page through a large partition,
we do not have to read the entire partition from disk.

This patch only implements the read side of promoted index - a later patch
will add the write-side support (i.e., writing the promoted index to the
index file while saving the sstable). Nevertheless this patch can already
be tested by reading existing sstables from Cassandra which include a
promoted index - such as the one included in the test in the previous patch.

The use of the promoted index currently has two limitations:

1. It is only used when reading a single partition with sstable::read_row(),
not when scanning through many partitions with sstable::read_range_rows()
or sstable::read_rows().

2. It is only used when filtering a single clustering-key range, rather
than a list of disjoint ranges. A single range is the common case.

These two issues will be improved later. In the meantime, in those
unsupported cases we simply continue to read entire partitions, so we're not
worse-off than before.

Also note that this patch only helps when sstable::read_row() is used with
a clustering-key prefix (i.e., a slice). Our higher-level request handling
code may decide to read an entire partition into the cache, and not use
a clustering-key prefix at all when reading. We will need to indepdently
improve the high-level code to use read_row()'s slicing capabilities
when paging through large partitions, for example.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:09 +03:00
Nadav Har'El
6cdf5684f5 sstables: introduce find_disk_ranges()
Our sstable reading code is currently hard-coded to read entire partitions,
even if we know that only a subset of the columns are requested.
This patch introduces find_disk_ranges(), a function to find the ranges of
bytes we need to read from the sstable data file to guarantee that the
desired columns from the desired partition are read.

The returned range may be the entire byte range of the given partition -
as found using the summary and index files - but if the index contains a
"promoted index" (basically a sample of column positions for each key)
we may return a smaller range. The "disk_read_range" type introduced in
the previous patch is extended here to support reading a partial partition -
by including additional information which would be missed when reading only
part of a partition (viz., the partition key and the partition's tombstone).

This function isn't used in this patch - we will wire its use in the next
patch, which will complete the read-side support for the promoted index.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:08 +03:00
Nadav Har'El
1d38a69e49 sstables: expose promoted index in index entry
Our index_entry type, holding one partition's entry that we read from the
index file, already contained the "_promoted_index" which we read from
disk - as an unparsed byte buffer. But there wasn't any API to access
this buffer after it was read. This patch adds a trivial getter, to get
a read-only view of this buffer.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:06 +03:00
Nadav Har'El
4e5a09538d make column-name parsing code public
The "struct column" code in partition.cc is generally useful code for
parsing serialized column names from the sstable. It is currently private
inside the "mp_row_consumer" class. But in a next patch we'll also want
to use it in the "sstable" class, for the promoted-index parsing code,
which among other things also needs to deserialize column names.

The trivial fix, in this patch, is to make this code "public". However,
for now it is still available only in partition.cc.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:04 +03:00
Nadav Har'El
1975abbfd6 sstables: disk_read_range
Currently, the main sstable data parsing entry point data_consume_rows()
takes a contiguous range of bytes to read from disk and parse. This range
is supposed to be an entire partition or contiguous group of partitions.
and is self contained (can be parsed without extra information about the
identity of these partitions).

For the promoted index feature (which we will add in a following patch)
we will want the range to span only a part of a partition, and will need
the caller to provide some information not available to the parser (such
as the partition's key). In the future, we will also want to support a
vector of byte ranges, instead of just one.

So in preparation for this, this patch simply replaces the start/end pair
by a new class disk_read_range, which can be easily extended in later
patches. No new functionality is introduced in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:02 +03:00
Nadav Har'El
6fed716bd3 sstables: STOP_THEN_ATOM_START parser state
In a later patch adding "promoted index" read support, we would like to
parse only part of an sstable row. In that case, the parser should start
not at the usual ROW_START state, but rather at the ATOM_START state.

But there's a problem: The sstable parser consumer currently assumes that
the parser stops after the start of the row, before reading any atoms.
So in the partial row case too, we must stop parsing before reading the
first atom.

For this, this patch adds the new "STOP_THEN_ATOM_START" parser state.
When starting in this state, the parser stops immediately (with
row_consumer::proceed::no), and when restarted again it will be in the
ATOM_START case.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:01 +03:00
Nadav Har'El
3dd079fb7a tests: add test for reading parts of a large partition
This patch adds a test that takes an sstable with one partition of 13,520
clustering rows (spanning 700 KB in the data file), and attempts to read
various slices CQL rows, counting that we got back the expected number
of rows.

The sstable included here was generated by Cassandra, and includes a
promoted index. Promoted index reading is not supported yet (we will
add it in the next patch), so for now the code will always read the
entire partition from disk; But still the clustering-key filtering is
already functional, and will drop some of the rows as requested,
so this test will pass.

Later, when we add promoted index support, we should check that this test
still passes - promoted index will make the reads in this test more
efficient (which the test cannot verify), but the important thing to check
is that it doesn't break any of these tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:46:59 +03:00
Takuya ASADA
3d45d6579b dist/ami/files: add a message for privacy policy agreement on login prompt
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1470212054-351-1-git-send-email-syuu@scylladb.com>
2016-08-07 17:40:25 +03:00
Amnon Heiman
beba3a31cb scylla-housekeeping.service: Add a configuration file
Adding the configuration file is a way to make the running
scylla-housekeeping service not run the check version.

If the file does not exists, it will be set by the setup script.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-08-07 16:24:43 +03:00
Amnon Heiman
9c4bf651ae install housekeeping.conf based on dist flag
The new dist flag difrentiate between public distribution and private
compilations.

For public distributation the housekeeping configuration file will be
installed.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-08-07 16:24:42 +03:00
Amnon Heiman
060c3029b0 scylla_setup: create the scylla-housekeeping conf file if missing
When running the scylla_setup, the script would check that the
housekeeping configuration file: housekepping.conf exists.
If not, it would ask the user if to run the check version option or not
and create the file accordingly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-08-07 16:24:23 +03:00
Amnon Heiman
63582dd043 Add a config file for housekeeping
The housekeeping.conf is a configuration file for scylla-housekeeping.
By default it will be included in the rpm and state that the
check-version would be run.

If the file is missing, or if check-version is set to false, the check
version operation will not be performed.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-08-07 13:14:59 +03:00
Amnon Heiman
93737a601f scylla-housekeeping: An optional config file
This patch adds an optional config file that can be passed from the
command line.

If the config file is specified and does not exist, the script will
terminate.

The only parameter that is currently available is check-version that can
be either true or false.

The ConfigParser module is used to read the config file, that should be
on the form of:
[housekeeping]
check-version: True

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-08-07 13:14:59 +03:00
Duarte Nunes
e0a43a82c6 system_keyspace: Correctly deal with wrapped ranges
This patch ensures we correctly deal with ranges that wrap around when
querying the size_estimates system table.

Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470412433-7767-1-git-send-email-duarte@scylladb.com>
2016-08-05 19:17:00 +03:00
Avi Kivity
b0a275945f Merge "Remove compact columns" from Duarte
"The compact column is a dense schema's single regular column. Its
existence has been a source of bugs, so this patchset removes the
column_kind::compact_column, as well as further references to compact
columns from the code base.

Fixes #1542"
2016-08-05 12:39:23 +03:00
Takuya ASADA
bd1ab3a0ad dist/ami/files: show warning message for unsupported instance types
Notify users to run scylla_io_setup before lunching scylla on unsupported instance types.

Fixes #1511

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1470090415-8632-1-git-send-email-syuu@scylladb.com>
2016-08-05 09:51:13 +03:00
Avi Kivity
28ee2bdbd2 Merge "Docker image fixes" from Pekka
"Kubernetes is unhappy with our Docker image because we start systemd
under the hood. Fix that by switching to use "supervisord" to manage the
two processes -- "scylla" and "scylla-jmx":

  http://blog.kunicki.org/blog/2016/02/12/multiple-entrypoints-in-docker/

While at it, fix up "docker logs" and "docker exec cqlsh" to work
out-of-the-box, and update our documentation to match what we have.

Further work is needed to ensure Scylla production configuration works
as expected and is documented accordingly."
2016-08-04 15:11:18 +03:00
Pekka Enberg
394c8f8c4f dist/docker: Document Scylla cluster setup
Add instructions on how to make a cluster of two Scylla nodes.
2016-08-04 12:20:46 +03:00
Glauber Costa
fe6a0d97d1 logalloc: make sure allocations in release_requests don't recurse back into the allocator
Calls like later() and with_gate() may allocate memory, although that is not
very common. This can create a problem in the sense that it will potentially
recurse and bring us back to the allocator during free - which is the very thing
we are trying to avoid with the call to later().

This patch wraps the relevant calls in the reclaimer lock. This do mean that the
allocation may fail if we are under severe pressure - which includes having
exhausted all reserved space - but at least we won't recurse back to the
allocator.

To make sure we do this as early as possible, we just fold both release_requests
and do_release_requests into a single function

Thanks Tomek for the suggestion.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>
2016-08-04 11:16:53 +02:00
Pekka Enberg
7deddbe17a dist/docker: Fix Docker Hub documentation
Fix Docker Hub documentation to match what we have right now.

More work is needed in the following areas:

* How to make a cluster
* How to configure Docker image for production use
2016-08-04 10:05:08 +03:00
Pekka Enberg
6c8c60a5fc dist/docker: Setup hostname in cqlshrc
We configure the hostname in the "CQLSH_HOST" environment variable but
that is only picked up if we first start the shell. Setup the hostname
in $HOME/.cqlshrc file instead so that we can start "cqlsh" directly:

  docker exec -it scylla cqlsh
2016-08-04 09:57:08 +03:00
Pekka Enberg
d0aeb53e7c dist/docker: Log to stdout instead of syslog
We don't have systemd running on the image so "journalctl" is useless.
Log to stdout instead which has the nice benefit of making "docker logs"
produce meaningful output on the host.
2016-08-04 09:46:26 +03:00
Glauber Costa
ad58691afb logalloc: make sure blocked requests memory allocations are served from the standar allocator
Issue 1510 describes a scenario in which, under load, we allocate memory within
release_requests() leading to a reentry into an invalid state in our
blocked requests' shared_promise.

This is not easy to trigger since not all allocations will actually get to the
point in which they need a new segment, let alone have that happening during
another allocator call.

Having those kinds of reentry is something we have always sought to avoid with
release_requests(): this is the reason why most of the actual routine is
deferred after a call to later().

However, that is a trick we cannot use for updating the state of the blocked
requests' shared_promise: we can't guarantee when is that going to run, and we
always need a valid shared_promise, in a valid state, waiting for new requests
to hook into.

The solution employed by this patch is to make sure that no allocation
operations whatsoever happen during the initial part of release_requests on
behalf of the shared promise.  Allocation is now deferred to first use, which
relieves release_requests() from all allocation duties. All it needs to do is
free the old object and signal to the its user that an allocation is needed (by
storing {} into the shared_promise).

Fixes #1510

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>
2016-08-03 20:40:30 +02:00
Duarte Nunes
cb0516a76c schema: Remove compact_column concept
This is a confusing one, and can be replaced the fact that dense
schemas have a single regular column.

Ref #1542

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-03 17:21:41 +00:00
Duarte Nunes
529c3a3ae6 column_kind: Drop compact_column
A compact column is a dense schema's single regular column. The fact
that it is a different column_kind has lead to various bugs (#1535,
derived by the schema being dense and the column being regular.

Fixes #1542

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-03 17:21:37 +00:00
Calle Wilund
0f9e868839 commitlog: Use exception propagation in flush_queue (for batch)
Fixes: #1490

While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.

Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.

Thus, adding a muation to commit log in batch, will now, if an error
is generated, result in an exception to the caller, which should be
interpreted as "data might not have been persisted".

The failing segment is then closed, and we happily hope things will
get better in the next. Which they probably wont.

Missing: registration of some sort of "error-handling policy", similar
to origin, which can either kill transports or shut down process.
(A reasonable guess is that disk errors in commit log are not gonna
be recoverable).
2016-08-03 14:49:43 +00:00
Calle Wilund
9098eed30b flush_queue_test: Add tests for exception propagation
v2:
* Remove leading "_" in template types
2016-08-03 14:49:43 +00:00
Calle Wilund
620e54cae4 flush_queue: Allow exception propagation to waiters
Re-worked to use shared_promise<> as signal mechanism
(because we have that now), which also makes it less painful
to implement exceptions propagating not only from "func" to
"post", but also from given func->post chain entry to any
waiters.

v2:
* Remove leading "_" in template types
2016-08-03 14:49:38 +00:00
Tomasz Grabiec
9476bc5a31 Introduce --abort-on-lsa-bad-alloc command line option
Useful for triggerring core dump on allocation failure inside LSA,
which makes it easier to debug allocation failures. They normally
don't cause aborts, just fail the current operation, which makes it
hard to figure out what was the cause of allocation failure.

Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>
2016-08-03 17:26:44 +03:00
Avi Kivity
9df4ac53e5 conf: synchronize internode_compression between scylla.yaml and code
Our default is "none", to give reasonable performance, so have scylla.yaml
reflect that.
2016-08-03 16:50:48 +03:00
Amnon Heiman
b18b067b26 Add prometheus API
This patch adds the prometheus API it adds the proto library to the
compilation, adds an optional configuration parameter to change the
prometheus listening port and start the prometheus API in main.

To disable the prometheus API, set its listening port to 0.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470231628-22831-2-git-send-email-amnon@scylladb.com>
2016-08-03 16:49:42 +03:00
Amnon Heiman
bb4268a8a5 Add prometheus API
This patch adds the prometheus API it adds the proto library to the
compilation, adds an optional configuration parameter to change the
prometheus listening port and start the prometheus API in main.

To disable the prometheus API, set its listening port to 0.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470228764-19545-2-git-send-email-amnon@scylladb.com>
2016-08-03 15:55:18 +03:00
Duarte Nunes
1516cd4c08 schema: Dense schemas are correctly upgrades
When upgrading a dense schema, we would drop the cells of the regular
(compact) column. This patch fixes this by making the regular and
compact column kinds compatible.

Fixes #1536

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470172097-7719-1-git-send-email-duarte@scylladb.com>
2016-08-03 13:39:01 +02:00
Paweł Dziepak
02ffc28f0d sstables: extend sstable life until reader is fully closed
data_consume_rows_context needs to have close() called and the returned
future waited for before it can be destroyed. data_consume_context::impl
does that in the background upon its destruction.

However, it is possible that the sstable is removed before
data_consume_rows_context::close() completes in which case EBADF may
happen. The solution is to make data_consume_context::impl keep a
reference to the sstable and extend its life time until closing of
data_consume_rows_context (which is performed in the background)
completes.

Side effect of this change is also that data_consume_context no longer
requires its user to make sure that the sstable exists as long as it is
in use since it owns its own reference to it.

Fixes #1537.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1470222225-19948-1-git-send-email-pdziepak@scylladb.com>
2016-08-03 13:19:08 +02:00
Pekka Enberg
59bd5e485b dist/docker: Use supervisord to manage multiple processes
Switch to supervisord to manage the two processes we have: Scylla server
and Scylla JMX proxy. We need this to make the Docker image run under
Kubernetes, which now fails as follows as we try to start the systemd
init process:

  Couldn't find an alternative telinit implementation to spawn.

I have not seen other people hitting the issue, except for GitLab Docker
image:

  https://gitlab.com/gitlab-org/gitlab-ce/issues/18612

which "solved" the problem by not running init...

  https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/838/diffs

Furthermore, the "supervisord" approach seems to be what people actually
use in Docker land:

  http://blog.kunicki.org/blog/2016/02/12/multiple-entrypoints-in-docker/

The only downside is that we now sort of duplicate functionality that's
already in the systemd configuration files. However, we should work
towards Scylla figuring out its configuration rather than compose a long
list of command line arguments. Once we do that, the duplication in
Docker supervisord scripts disappears.
2016-08-03 11:59:04 +03:00
Paweł Dziepak
5f11a727c9 Merge "partition_limit: Don't count dead partitions" from Duarte
"This patch series ensures we don't count dead partitions (i.e.,
partitions with no live rows) towards the partition_limit. We also
enforce the partition limit at the storage_proxy level, so that
limits with smp > 1 works correctly."
2016-08-03 09:49:30 +01:00
Duarte Nunes
db1118e4f7 database_test: Add case for partition limit
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-02 22:11:15 +00:00
Duarte Nunes
84e3969014 mutation_query_test: Add test for partition limit
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-02 22:08:39 +00:00
Duarte Nunes
b0c5996580 read_command: Add comment explaining partition_limit
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-02 21:17:06 +00:00
Duarte Nunes
54ad038aa6 storage_proxy: Enforce partition_limit
This patch enforces the partition_limit at the
mutation_result_merger.

Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-02 21:17:06 +00:00
Duarte Nunes
ec490ffaba query_result_builder: Don't count dead partitions
With this patch we stop counting dead partitions (i.e., partitions
containing only tombstones) towards the partition limit, which
should apply only to partitions with live rows.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-02 21:17:06 +00:00
Duarte Nunes
167e400ca8 compact_mutation: Don't count dead partitions
With this patch we stop counting dead partitions (i.e., partitions
containing only tombstones) towards the partition limit, which should
apply only to partitions with live rows.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-02 21:17:06 +00:00
Paweł Dziepak
0f902738f0 Revert "storage_proxy: Enforce partition_limit"
This reverts commit 141ea49e05.

There was a confusion around the meaning of "partition limit".
Parts of our code interpreted it just as "maximum number of partitions".
This is also how Cassandra behaves.

However, the other parts of the code, including data query, interpreted
it as "maximum number of live partitions" or otherwise skipped dead
partitions resulting in #1447.

A decision has been made to stick to the "maximum number of live
partitions" interpretation everywhere. The consequences are, among
others, that the patch reverted by this one is no longer correct.

While, the actual series fixing the interpretations of partition limit
and getting rid of the confusion is yet to come, the purpose of this
revert is to make backporting easier (as the patch being reverted
hasn't made it to branch-1.3 yet).
2016-08-02 16:53:01 +01:00
Duarte Nunes
5995aebf39 schema_builder: Ensure dense tables have compact col
This patch ensures that when the schema is dense, regardless of
compact_storage being set, the single regular columns is translated
into a compact column.

This fixes an issue where Thrift dynamic column families are
translated to a dense schema with a regular column, instead of a
compact one.

Since a compact column is also a regular column (e.g., for purposes of
querying), no further changes are required.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470062410-1414-1-git-send-email-duarte@scylladb.com>
2016-08-02 14:49:13 +02:00
Avi Kivity
fbc3377ad4 row_cache: add a counter for a miss that did not result in an insertion
Such misses are due to concurrent access to the same key.  Add a counter
to track this as it results in unnecessary I/O being performed.

See #1534.
Message-Id: <1470139871-14693-1-git-send-email-avi@scylladb.com>
2016-08-02 14:14:27 +02:00
Tomasz Grabiec
d2ed75c9ff Merge 'pdziepak/row-cache-wide-entries/v4' from seastar-dev.git
This series adds the ability for partition cache to keep information
whether partition size makes it uncacheable. During, reads these
entries save us IO operations since we already know that the partiiton
is too big to be put in the cache.

First part of the patchset makes all mutation_readers allow the
streamed_mutations they produce to outlive them, which is a guarantee
used later by the code handling reading large partitions.
2016-08-02 12:26:56 +02:00
Duarte Nunes
141ea49e05 storage_proxy: Enforce partition_limit
This patch enforces the partition_limit at the
mutation_result_merger.

Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470065526-3174-1-git-send-email-duarte@scylladb.com>
2016-08-02 10:10:43 +01:00
Avi Kivity
9f35e4d328 checked_file: preserve DMA alignment
Inherit the alignment parameters from the underlying file instead of
defaulting to 4096.  This gives better read performance on disks with 512-byte
sectors.

Fixes #1532.
Message-Id: <1470122188-25548-1-git-send-email-avi@scylladb.com>
2016-08-02 10:03:02 +01:00
Paweł Dziepak
d7123d21eb Merge "cql3: remove schema_altering_statement::prepare" from Avi
"schema_altering_statement is both a cql_statement and a
prepared_statement.
This makes it hard to understand because virtual functions from both
base classes are present, and hard to separate the raw and prepared
variants.

This patchset removes schema_altering_statement::prepare() (and the
enable_shared_from_this<> that makes it work) in preparation for
splitting its subclasses into raw and prepared variants (note that
create_table_statement was already split)."
2016-08-02 09:46:55 +01:00
Tomasz Grabiec
0c1bf6c861 scylla-gdb.py: Fix lookup of global symbols
Fixes errors like the one below:

  (gdb) scylla memory
  Python Exception <class 'gdb.error'> A syntax error in expression, near `memory::cpu_mem'.:
  Error occurred in Python command: A syntax error in expression, near `memory::cpu_mem'.

Wrapping the symbol in quotes instructs GDB to lookup in the global
context instead of the context of current frame.
Message-Id: <1470050751-3167-1-git-send-email-tgrabiec@scylladb.com>
2016-08-01 13:51:15 +01:00
Calle Wilund
e4a845145a flush_queue_test: Do "close" at end of tests to ensure gate is balanced 2016-08-01 08:23:55 +00:00
Calle Wilund
a277975fd4 flush_queue: ensure gate is always closed (even with exceptions) 2016-08-01 08:23:42 +00:00
Takuya ASADA
9b59bb59f2 dist/ami: Install scylla metapackage and debuginfo on AMI
Install scylla metapackage and debuginfo on AMI to make AMI to report bugs easier.
Fixes #1496

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1469635071-16821-1-git-send-email-syuu@scylladb.com>
2016-07-31 18:37:41 +03:00
Takuya ASADA
89b790358e dist/common/scripts: disable coredump compression by default, add an argument to enable compression on scylla_coredump_setup
On large memory machine compression takes too long, so disable it by default.
Also provide a way to enable it again.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1469706934-6280-1-git-send-email-syuu@scylladb.com>
2016-07-31 18:36:51 +03:00
Takuya ASADA
d3746298ae dist/ami: setup correct repository when --localrpm specified
There was no way to setup correct repo when AMI is building by --localrpm option, since AMI does not have access to 'version' file, and we don't passed repo URL to the AMI.
So detect optimal repo path when starting build AMI, passes repo URL to the AMI, setup it correctly.

Note: this changes behavor of build_ami.sh/scylla_install_pkg's --repo option.
It was repository URL, but now become .repo/.list file URL.
This is optimal for the distribution which requires 3rdparty packages to install scylla, like CentOS7.
Existing shell scripts which invoking build_ami.sh are need to change in new way, such as our Jenkins jobs.

Fixes #1414

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1469636377-17828-1-git-send-email-syuu@scylladb.com>
2016-07-31 18:36:21 +03:00
Avi Kivity
ec62f0d321 Merge "housekeeping: Switch to pytho2 and handle version" from Amnon
This series handle two issues:
* Moving to python2, though python3 is supported, there are modules that we
  need that are not rpm installable, python3 would wait when it will be more
  mature.

* Check version should send the current version when it check for a new one and
  a simple string compare is wrong.
2016-07-31 14:55:36 +03:00
Amnon Heiman
3170b477d0 ubuntu control.in: set python2 request dependency
scylla-housekeeping moved to python2, this change the dependency to take
the python2 requests module.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-07-31 10:47:22 +03:00
Amnon Heiman
c8bcb5a8bf scylla.spec: Set the python dependencies for housekeeping
The scylla-housekeeping moved to python2, this set the python
dependencies under redhat.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-07-31 10:40:40 +03:00
Avi Kivity
a6ec4aa547 cql3: schema_altering_statement: remove prepare()
In preparation for splitting raw and prepared variants of subclasses
of schema_altering_statement, remove schema_altering_statment::prepare().
All subclasses already implement it themselves (by creating a new instance).
2016-07-30 23:29:32 +03:00
Avi Kivity
0687dc3689 cql3: add fake create_table_statement::prepare()
In preparation for the removal of schema_altering_statement::prepare(),
add a fake create_table_statement::prepare().  create_table_statement
has already been split to raw and prepared variants, so this prepare()
will never be called, but it is required because schema_altering_statement
is both a cql_statement and a prepared_statement.  This confusion will
be fixed later on.
2016-07-30 23:26:28 +03:00
Avi Kivity
713cfc3182 cql3: copy drop_type_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 23:14:49 +03:00
Avi Kivity
6631cd3277 cql3: copy drop_table_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 23:14:35 +03:00
Avi Kivity
382235fca4 cql3: copy drop_keyspace_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 23:14:18 +03:00
Avi Kivity
6a099f2690 cql3: copy create_type_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 23:14:03 +03:00
Avi Kivity
2d961334d8 cql3: copy create_keyspace_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 23:13:45 +03:00
Avi Kivity
35ef82e78d cql3: copy create_index_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 23:12:55 +03:00
Avi Kivity
23eef0a610 cql3: copy alter_type_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 23:12:24 +03:00
Avi Kivity
8b50f75958 cql3: copy alter_table_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 23:10:43 +03:00
Avi Kivity
81b75dfa47 cql3: copy alter_keyspace_statement when preparing it
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
2016-07-30 22:45:08 +03:00
Avi Kivity
b881945d45 estimated_histograms: fix indentation, bracing 2016-07-30 20:13:16 +03:00
Avi Kivity
75ee8fc2a7 size_estimates_recorder: adjust indentation 2016-07-30 20:10:12 +03:00
Piotr Jastrzebski
ca9c29e296 Cache information about partition being wide
Once we encounter a wide partition store information
about this in cache entry and don't try to read it all
and cache next time it's requested.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[Paweł: rebased, moved large partition reading logic to
cache_entry::read_wide()]
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-29 18:39:22 +01:00
Paweł Dziepak
42f566433e tests/row_cache_test: use BOOST_REQUIRE_EQUAL() istead of raw assert()
In case of failure BOOST_REQUIRE_EQUAL() is nicer and prints the actual
values that were supposed to be equal, but aren't.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-29 17:19:18 +01:00
Paweł Dziepak
ee1f1ee1c4 row_cache: fix creating readers for large partitions
There were cases of use-after-free introduced by the code responsible
for creating mutation_readers for large partitions – the lifetimes
of partition ranges and the readers themselves weren't sufficiently
extended.

Another problem, was that if the partition was no longer present in the
sstable the reader would return EOS which was then returned by
range_populating_reader itself causing its users to incorrectly
interpret that as an end of stream.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-29 17:02:17 +01:00
Paweł Dziepak
7b479d8b41 clarify relations between mutation_reader and streamed_mutation
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-29 15:58:42 +01:00
Paweł Dziepak
3a27582cd4 sstables: allow streamed_mutation to outlive mutation_reader
This patch makes sstable_streamed_mutation keep a reference to
sstable_data_source object which contains full state necessary to read
the sstable. That state is also shared with parent mutation_reader (only
for range queries), but now its lifetime is appropriately extended if
the mutation_reader is destoryed before streamed_mutation.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-29 15:53:09 +01:00
Paweł Dziepak
5ba4cd1a0b sstables: enable_lw_shared_from_this for sstable
sstable has member functions that create objects which need to extend
lifetime of the sstable (for example mutation_readers), the easiest way
to achieve that is to enable_lw_shared_from_this for sstable.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-29 15:51:12 +01:00
Duarte Nunes
7d1b7e8da3 storage_service: Fix get_range_to_address_map_in_local_dc
This patch fixes a couple of bugs in
get_range_to_address_map_in_local_dc.

Fixes #1517

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469782666-21320-1-git-send-email-duarte@scylladb.com>
2016-07-29 11:11:07 +02:00
Gleb Natapov
3531dd8d71 api: fix use after free in sum_sstable
get_sstables_including_compacted_undeleted() may return temporary shared
ptr which will be destroyed before the loop if not stored locally.

Fixes #1514

Message-Id: <20160728100504.GD2502@scylladb.com>
2016-07-28 14:25:40 +03:00
Piotr Jastrzebski
fdfd1af694 Use continuity flag correctly with concurrent invalidations
Between reading cache entry and actually using it
invalidations can happen so we have to check if no flag was
cleared if it was we need to read the entry again.

Fixes #1464.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <7856b0ded45e42774ccd6f402b5ee42175bd73cf.1469701026.git.piotr@scylladb.com>
2016-07-28 11:55:18 +01:00
Duarte Nunes
25a44ee6cf sstables: Validate static cell is on static column
This patch enforces compatibility between a cell and the
corresponding column definition with regards to them being
static.

[tgrabiec: Fixed typo in "definition"]

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469699532-26984-1-git-send-email-duarte@scylladb.com>
2016-07-28 12:01:31 +02:00
Duarte Nunes
6fc6adbdeb sstable_mutation_test: Test non-compound cell name
This patch adds a test case for reading non-compound cell names,
validating that such a cell is not incorrectly marked as static.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-5-git-send-email-duarte@scylladb.com>
2016-07-28 11:12:20 +02:00
Duarte Nunes
40f388f2b5 sstables: Don't assume cell name is compound
The current code assumes cell names are always compound and may
wrongly report a non-static row as such, since it looks at the first
bytes of the name assuming they are the component's length.

Tables with compact storage (which cannot contain static rows) may not
have a compound comparator, so we check for the table's compoundness
before checking for the static marker. We do this by delegating to
composite_view::is_static.

Fixes #1495

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-4-git-send-email-duarte@scylladb.com>
2016-07-28 11:12:06 +02:00
Duarte Nunes
05c3d4f22b composite: Use operator[] instead of at()
Since we already do bounds checking on is_static(), we can use
bytes_view::operator[] instead of bytes_view::at() to avoid repeating
the bounds checking.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-3-git-send-email-duarte@scylladb.com>
2016-07-28 11:11:54 +02:00
Duarte Nunes
6c9076fdd7 composite_view: Fix is_static
composite_view's is_static function is wrong because:

1) It doesn't guard against the composite being a compound;
2) Doesn't deal with widening due to integral promotions and
   consequent sign extension.

This patch fixes this by ensuring there's only one correct
implementation of is_static, to avoid code duplication and
enforce test coverage.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-2-git-send-email-duarte@scylladb.com>
2016-07-28 11:11:38 +02:00
Amnon Heiman
406fa11cc5 scylla-housekeeping: check version should use the current version
This patch handle two issues with check version:
* When checking for a version, the script send the current version
* Instead of string compare it uses parse_version to compare the
versions.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-07-28 11:31:05 +03:00
Amnon Heiman
641e5dc57c scylla-housekeeping: Switchng to pythno2
There is a problem with python module installation in pythno3,
especially on centos. Though pytho34 has a normal package, alot of the
modules are missing yum installation and can only be installed by pip.

This patch switch the  scylla-housekeeping implementation to use
python2, we should switch back to python3 when CeontOS python3 will be
more mature.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-07-28 11:27:52 +03:00
Tomasz Grabiec
27f6c0d62b tests: lsa_async_eviction_test: Use chunked_fifo<>
To protect against large reallocations during push() which are done
under reclaim lock and may fail.
2016-07-28 09:33:12 +02:00
Tomasz Grabiec
17b45fae9e tests/memory_footprint: Fix runtime errors
The test bit rot. One thing is that cache is no longer empty right
after creation, since we added ghost entries to it. Second problem is
that mutation serializer needs storage service to be initialized, so
we need to setup a full cql test env.
Message-Id: <1469626546-4279-1-git-send-email-tgrabiec@scylladb.com>
2016-07-27 14:38:36 +01:00
Duarte Nunes
f468453cbe README.md: Replace yum with dnf
yum is démodé.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616395-5537-1-git-send-email-duarte@scylladb.com>
2016-07-27 13:50:31 +03:00
Duarte Nunes
e4464f7500 README.md: Add protobuf dependencies
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616338-5199-1-git-send-email-duarte@scylladb.com>
2016-07-27 13:50:19 +03:00
Duarte Nunes
5aaf43d1bc thrift: Preserve partition order when accumulating
This patch changes the column_visitor so that it preservers the order
of the partitions it visits when building the accumulation result.

This is required by verbs such as get_range_slice, on top of which
users can implement paging. In such cases, the last key returned by
the query will be that start of the range for the next query. If
that key is not actually the last in the partitioner's order, then
the new request will likely result in duplicate values being sent.

Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469568135-19644-1-git-send-email-duarte@scylladb.com>
2016-07-27 10:34:29 +02:00
Avi Kivity
64d0cf58ea size_estimates_recorder: unwrap ranges before searching for sstables
column_family::select_sstables() requires unwrapped ranges, so unwrap
them.  Fixes crash with Leveled Compaction Strategy.

Fixes #1507.

Reviewed-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469563488-14869-1-git-send-email-avi@scylladb.com>
2016-07-27 10:06:21 +03:00
Takuya ASADA
b542581d97 dist: add protobuf to dependencies, for newly added prometheus API support
Since we added prometheus API support, we need protobuf compiler and header to build seastar/scylla, so add dependencies on packaging.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1469573936-8551-1-git-send-email-syuu@scylladb.com>
2016-07-27 08:48:58 +03:00
Paweł Dziepak
efa690ce8c stables: fix skipping partitions with no rows
If partition contains no static and clustering rows or range tombstones
mp_row_consumer will return disengaged mutation_fragment_opt with
is_mutation_end flag set to mark end of this partition.

Current, mutation_reader::impl code incorrectly recognized disengaged
mutation fragment as end of the stream of all mutations. This patch
fixes that by using is_mutation_end flag to determine whether end of
partition or end of stream was reached.

Fixes #1503.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1469525449-15525-1-git-send-email-pdziepak@scylladb.com>
2016-07-26 13:02:03 +03:00
Amos Kong
64530e9686 scylla-housekeeping: fix typo of script path
I tried to start scylla-housekeeping service by:
 # sudo systemctl restart scylla-housekeeping.service

But it's failed for wrong script path, error detail:
 systemd[5605]: Failed at step EXEC spawning
 /usr/lib/scylla/scylla-Housekeeping: No such file or directory

The right script name is 'scylla-housekeeping'

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <c11319a3c7d3f22f613f5f6708699be0aa6bd740.1469506477.git.amos@scylladb.com>
2016-07-26 09:18:43 +03:00
Raphael S. Carvalho
b7cdfafbdd tests: fix compilation of partitioner test
For some unknown reason, there were some duplicated definitions
of bytes3. They were not needed at all.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <7f2b87211c592e573f45f277dc0ab4a8c037f258.1469490327.git.raphaelsc@scylladb.com>
2016-07-26 09:04:08 +03:00
Avi Kivity
1d1f61ac65 bytes_ostream: fix assignment operators
Protect against self-assignment, and provide the strong exception guarantees
for copying assignment.

Fixes #1499.
Message-Id: <1469466156-11114-1-git-send-email-avi@scylladb.com>
2016-07-25 19:06:42 +02:00
Avi Kivity
9e515c17a3 Merge "Optimize deserialization" from Tomasz
Fixes #1330.
2016-07-25 19:52:28 +03:00
Tomasz Grabiec
cea4957a11 serializer: Avoid copying when deserializing bytes_ostream 2016-07-25 17:35:42 +02:00
Tomasz Grabiec
22664b9b7f serializer: Optimize serializer<>::skip() for remaining types 2016-07-25 17:35:42 +02:00
Tomasz Grabiec
2e121d4b18 serializer: Implement serializer<std::vector<T>>::skip() using skip_array<T>() 2016-07-25 17:35:42 +02:00
Tomasz Grabiec
af94ed538e serializer: Optimize serializer<std::array<T, N>>::skip() 2016-07-25 17:35:42 +02:00
Tomasz Grabiec
fd5ccce919 serializer: Move skip() to serializer.hh
It is part of the core API, like serialize() and deserialize().
2016-07-25 17:35:42 +02:00
Tomasz Grabiec
d3658b33da tests: Add test for skip() not doing full deserialization 2016-07-25 17:35:42 +02:00
Tomasz Grabiec
c3707ff754 idl: Avoid full deserialization in skip()
Fixes #1330.
2016-07-25 17:35:34 +02:00
Tomasz Grabiec
033312f686 serializer: Add serializer<enum_set<T>> 2016-07-25 17:22:28 +02:00
Tomasz Grabiec
1ddba66861 serializer: Add serializer<std::unique_ptr<T>> 2016-07-25 17:22:28 +02:00
Tomasz Grabiec
4ec29d88d3 serializer: Add serializer<sstring> 2016-07-25 17:22:28 +02:00
Tomasz Grabiec
517a501ace serializer: Add serializer<std::experimental::optional<T>> 2016-07-25 17:22:28 +02:00
Tomasz Grabiec
c67e047b92 serializer: Add serializer<bytes_ostream> 2016-07-25 17:22:28 +02:00
Tomasz Grabiec
1bc63a133b serializer: Add serializer<bytes> 2016-07-25 17:22:28 +02:00
Tomasz Grabiec
51e25cb50e serializer: Add serializer<std::chrono::time_point<Clock, Duration>> 2016-07-25 17:22:25 +02:00
Tomasz Grabiec
f965e64a05 serializer: Add serializer<std::map<K, V>> 2016-07-25 17:22:21 +02:00
Tomasz Grabiec
5ffaccfa7d serializer: Add serializer<std::array<T, N>> 2016-07-25 17:20:52 +02:00
Tomasz Grabiec
43a69e64f6 serializer: Add serializer<std::chrono::duration<T, Ratio>> 2016-07-25 17:20:08 +02:00
Tomasz Grabiec
445f763fa3 serializer: Add serializer<std::vector<T>> 2016-07-25 17:20:08 +02:00
Tomasz Grabiec
953fce3f7b serializer: Define serializer<> specializations for integral types 2016-07-25 17:19:50 +02:00
Avi Kivity
c908e630f0 Merge seastar upstream
* seastar 2425aab...64ae228 (1):
  > Merge "Adding prometheus API support" from Amnon
2016-07-25 15:32:40 +03:00
Avi Kivity
45e1274064 Merge "Add GDB commands for working with thread lists" from Tomasz 2016-07-25 15:32:02 +03:00
Tomasz Grabiec
0b8aafd72c scylla-gdb.py: Introduce "scylla thread apply all ..."
Similar to gdb's "thread apply all". Executes given command in the
context of each seastar thread.

For example to print backtrace of all threads:

  scylla thread apply-all bt
2016-07-25 12:40:47 +02:00
Tomasz Grabiec
2d6341d4cb scylla-gdb.py: Introduce "scylla threads" command
Lists all seastar threads.

Example:

(gdb) scylla threads
[shard  1] (seastar::thread_context*) 0x602008c9aa00
[shard  1] (seastar::thread_context*) 0x602008c9ca00
[shard  1] (seastar::thread_context*) 0x602008cf5800
[shard  1] (seastar::thread_context*) 0x602008cbe000
[shard  1] (seastar::thread_context*) 0x602008c4bc00
[shard  2] (seastar::thread_context*) 0x601008d6b800
[shard  2] (seastar::thread_context*) 0x601008d89400
[shard  2] (seastar::thread_context*) 0x601008c95a00
[shard  0] (seastar::thread_context*) 0x600008d84400
[shard  0] (seastar::thread_context*) 0x6000000a1600
[shard  0] (seastar::thread_context*) 0x600008ce9a00
[shard  0] (seastar::thread_context*) 0x600008df9a00
[shard  0] (seastar::thread_context*) 0x600008dfee00
[shard  0] (seastar::thread_context*) 0x600008d85800
[shard  0] (seastar::thread_context*) 0x600008df9000
[shard  0] (seastar::thread_context*) 0x600008d82c00
[shard  0] (seastar::thread_context*) 0x600008cece00
2016-07-25 12:32:48 +02:00
Tomasz Grabiec
0e1b21eab6 scylla-gdb.py: Make intrusive_list() work with newer version of boost 2016-07-25 12:32:48 +02:00
Duarte Nunes
d8a4bd6b1a sstables: Remove duplication in extract_clustering_key
This patch removes some duplicated code in extract_clustering_key(),
which is already handled in composite_view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469397806-8067-1-git-send-email-duarte@scylladb.com>
2016-07-25 10:07:25 +02:00
Duarte Nunes
b2278697ce sstables: Remove superfluous call to check_static()
When building a column we're calling check_static() two times;
refector things a bit so that this doesn't happen and we reuse the
previous calculation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469397748-7987-1-git-send-email-duarte@scylladb.com>
2016-07-25 10:06:56 +02:00
Duarte Nunes
7a81553d17 compound_compat: Only compound values can be static
If a composite is not a compound, then it doesn't carry a length
prefix where static information is encoded. In its absence, a
non-compound composite can never be static.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469397561-7748-1-git-send-email-duarte@scylladb.com>
2016-07-25 10:05:19 +02:00
Avi Kivity
f2939b125e dist: reduce kernel's inclincation to OOM-kill Scylla
Scylla is likely the most important process on the machine, make the kernel
less inclined to kill it, on systemd-enabled hosts.

See #1393.
Message-Id: <1468847986-9429-1-git-send-email-avi@scylladb.com>
2016-07-25 10:00:19 +02:00
Duarte Nunes
5c4a2044d5 thrift: Fail when creating mixed CF
This patch ensures we fail when creating a mixed column family, either
when adding columns to a dynamic CF through updated_column_family() or
when adding a dynamic column upon insertion.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469378658-19853-1-git-send-email-duarte@scylladb.com>
2016-07-25 10:35:37 +03:00
Duarte Nunes
560cc12fd7 thrift: Correctly translate no_such_column_family
The no_such_column_family exception is translated to
InvalidRequestException instead of to NotFoundException.

8991d35231 exposed this problem.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469376674-14603-1-git-send-email-duarte@scylladb.com>
2016-07-25 10:35:07 +03:00
Avi Kivity
df346d8b49 Merge "thrift: Implements describe_splits verbs" from Duarte
"This patchset implements the describe_splits and describe_splits_ex verbs
in Thrift. It also contains a couple of related fixes."
2016-07-25 10:08:00 +03:00
Raphael S. Carvalho
56ef3d4fde sstables: get rid of magic number
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <c0b3acf58aa6f7632bf15a2dd0dcac4d1b45d444.1469389290.git.raphaelsc@scylladb.com>
2016-07-25 10:06:51 +03:00
Takuya ASADA
b5bb702b35 dist: add dependency for lspci
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1469429320-13645-1-git-send-email-syuu@scylladb.com>
2016-07-25 10:02:30 +03:00
Duarte Nunes
ab08561b89 thrift: Implement describe_splits verb
This patch implements the describe_splits verb on top of
describe_splits_ex.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-24 22:44:07 +00:00
Duarte Nunes
472c23d7d2 thrift: Implement describe_splits_ex verb
This patch implements the describe_splits_ex verbs by querying the
size_estimates system table for all the estimates in the specified
token range.

If the keys_per_split argument is bigger then the
estimated partitions count, then we merge ranges until keys_per_split
is met. Note that the tokens can't be split any further,
keys_per_split might be less than the reported number of keys in one
or more ranges.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-24 22:44:01 +00:00
Duarte Nunes
ecfa04da77 system_keyspace: Add query_size_estimates() function
The query_size_estimates() function queries the size_estimates system
table for a given keyspace and table, filtering out the token ranges
according to the specified tokens.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-24 22:43:58 +00:00
Duarte Nunes
d984cc30bf size_estimates_recorder: Fix stop()
This patch fixes stop() by checking if the current CPU instead of
whether the service is active (which it won't be at the time stop() is
called).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-24 22:43:58 +00:00
Duarte Nunes
e16f3f2969 system_keyspace: Avoid pointers in range_estimates
This patch makes range_estimates a proper struct, where tokens are
represented as dht::tokens rather than dht::ring_position*.

We also pass other arguments to update_ and clear_size_estimates by
copy, since one will already be required.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-24 22:43:35 +00:00
Asias He
2f4cd86809 random_partitioner: Implement random_partitioner
Cassandra 1.x clusters often use RandomPartitioner. Supporting
RandomPartitioner will allow easier migration to Scylla

Tests are added to make sure scylla generates the same token as
Cassandra does for the same partition key.

Fixes #1438

Message-Id: <3bc8b7f06fad16d59aaaa96e2827198ce74214c6.1469166766.git.asias@scylladb.com>
2016-07-24 16:25:25 +03:00
Avi Kivity
a0369a6d5b Merge seastar upstream
* seastar 9d1db3f...2425aab (5):
  > Merge "Track all threads globally" from Tomasz
  > net: provide remote address on native_server_socket_impl<Protocol>::accept()
  > Introduce install-dependencies.sh
  > reactor: make sure a poll cycle always happens when later is called
  > Merge "rpc: various fixes and cleanups" from Gleb
2016-07-24 16:22:05 +03:00
Avi Kivity
b339337287 Merge "Add more tools to the GDB script" from Tomasz 2016-07-24 16:21:31 +03:00
Duarte Nunes
2be45c4806 thrift: Handle and convert invalid_request_exception
This patch converts an exceptions::invalid_request_exception
into a Thrift InvalidRequestException instead of into a generic one.

This makes TitanDB work correctly, which expects an
InvalidRequestException when setting a non-existent keyspace.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469362086-1013-1-git-send-email-duarte@scylladb.com>
2016-07-24 14:08:58 +02:00
Duarte Nunes
b5968ac244 sstables: Fix format string
This patch fixes a format string by using {} instead of %s.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469352081-2167-1-git-send-email-duarte@scylladb.com>
2016-07-24 14:01:04 +02:00
Avi Kivity
de62285591 bloom_filter: use correct types
'long' is not a defined size.  It happens to match Java's long on Linux x86_64,
but may not on other platforms (e.g. Windows x64).
Message-Id: <1469352705-1079-1-git-send-email-avi@scylladb.com>
2016-07-24 14:00:37 +02:00
Avi Kivity
900639915d bloom_filter: fix overflow for large filters
We use ::abs(), which has an int parameter, on long arguments, resulting
in incorrect results.

Switch to std::abs() instead, which has the correct overloads.

Fixes #1494.

Message-Id: <1469347802-28933-1-git-send-email-avi@scylladb.com>
2016-07-24 11:31:26 +03:00
Vlad Zolotarov
9423c13419 cql_server::connection::process_prepare(): don't std::move() a shared_ptr captured by reference in value_of() lambda
A seastar::value_of() lambda used in a trace point was doing the unthinkable:
it called std::move() on a value captured by reference. Not only it compiled(!!!)
but it also actually std::move()ed the shared_ptr before it was used in a make_result()
which naturally caused a SIGSEG crash.

Fixes #1491

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1469193763-27631-1-git-send-email-vladz@cloudius-systems.com>
2016-07-22 16:32:21 +03:00
Tomasz Grabiec
0db79c7153 scylla-gdb.py: Introduce "scylla mem-range" command
Prints memory range belonging to current shard.
2016-07-22 15:19:48 +02:00
Tomasz Grabiec
26e66bf49e scylla-gdb.py: Introduce "scylla thread" command
Switches into a seastar thread. "scylla unthread" switches back.

Example:

```
  (gdb) scylla thread 0x6000000e1800
  Switched to thread 1, (seastar::thread_context*) 0x6000000e1800

  (gdb) bt
  #0  seastar::thread_context::switch_out (this=0x6000000e1800) at core/thread.cc:104
  #1  0x00000000004cfb07 in future<>::wait() (this=0x600008ca2c70) at core/future.hh:817
  #2  0x0000000000f7752c in future<>::get() (this=0x600008ca2c70) at /home/tgrabiec/src/scylla2/seastar/core/future.hh:787
  ...
  #16 seastar::thread_context::main (this=0x6000000e1800) at core/thread.cc:166
  #17 0x000000000051a702 in seastar::thread_context::s_main (lo=<optimized out>, hi=<optimized out>) at core/thread.cc:157
  #18 0x00007f2c34861f20 in ?? () from /lib64/libc.so.6
  #19 0x0000000000000000 in ?? ()

  (gdp) scylla unthread
  Switched to thread 1
```
2016-07-22 15:19:48 +02:00
Tomasz Grabiec
d6fc9ad48e scylla-gdb.py: Break down code into finer abstractions 2016-07-22 15:19:48 +02:00
Raphael S. Carvalho
c4f34f5038 compaction: do not convert timestamp resolution to uppercase
C* only allows timestamp resolution in uppercase, so we shouldn't
be forgiving about it, otherwise migration to C* will not work.
Timestamp resolution is stored in compaction strategy options of
schema BTW.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d64878fc9bbcf40fd8de3d0f08cce9f6c2fde717.1469133851.git.raphaelsc@scylladb.com>
2016-07-22 07:03:27 +01:00
Avi Kivity
106e3703d9 sstables: stop using unaligned_cast
unaligned_cast violates strict aliasing, and causes code misgeneration on
gcc 6.  Replace it with read_be/write_be, which are nicer anyway.
Message-Id: <1469122850-7511-1-git-send-email-avi@scylladb.com>
2016-07-22 07:03:08 +01:00
Raphael S. Carvalho
56a50784f8 compaction_manager: make registration of sstables and weight exception safe
Compacting sstables and weight could be left unregistered in event of an
exception. Let's make it safe by using a RAII approach.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <f2cf9d0c12f22046293bd2185ef14ede3f4d63d4.1469114161.git.raphaelsc@scylladb.com>
2016-07-22 07:02:48 +01:00
Vlad Zolotarov
4647ad9d8a tracing: set a default TTL for system_traces tables when they are created
Fixes #1482

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1469104164-4452-1-git-send-email-vladz@cloudius-systems.com>
2016-07-22 07:02:29 +01:00
Vlad Zolotarov
57b58cad8e SELECT tracing instrumentation: improve inter-nodes communication stages messages
Add/fix "sending to"/"received from" messages.

With this patch the single key select trace with a data on an external node
looks as follows:

Tracing session: 65dbfcc0-4f51-11e6-8dd2-000000000001

 activity                                                                                                                        | timestamp                  | source    | source_elapsed
---------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                                                              Execute CQL3 query | 2016-07-21 17:42:50.124000 | 127.0.0.2 |              0
                                                                                                   Parsing a statement [shard 1] | 2016-07-21 17:42:50.124127 | 127.0.0.2 |             --
                                                                                                Processing a statement [shard 1] | 2016-07-21 17:42:50.124190 | 127.0.0.2 |             64
 Creating read executor for token 2309717968349690594 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 1] | 2016-07-21 17:42:50.124229 | 127.0.0.2 |            103
                                                                            read_data: sending a message to /127.0.0.1 [shard 1] | 2016-07-21 17:42:50.124234 | 127.0.0.2 |            108
                                                                           read_data: message received from /127.0.0.2 [shard 1] | 2016-07-21 17:42:50.124358 | 127.0.0.1 |             14
                                                          read_data handling is done, sending a response to /127.0.0.2 [shard 1] | 2016-07-21 17:42:50.124434 | 127.0.0.1 |             89
                                                                               read_data: got response from /127.0.0.1 [shard 1] | 2016-07-21 17:42:50.124662 | 127.0.0.2 |            536
                                                                                  Done processing - preparing a result [shard 1] | 2016-07-21 17:42:50.124695 | 127.0.0.2 |            569
                                                                                                                Request complete | 2016-07-21 17:42:50.124580 | 127.0.0.2 |            580

Fixes #1481

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1469112271-22818-1-git-send-email-vladz@cloudius-systems.com>
2016-07-21 19:46:43 +03:00
Tomasz Grabiec
5e8f0efc85 schema_tables: Fix hang during keyspace drop
Fixes #1484.

We drop tables as part of keyspace drop. Table drop starts with
creating a snapshot on all shards. All shards must use the same
snapshot timestamp which, among other things, is part of the snapshot
name. The timestamp is generated using supplied timestamp generating
function (joinpoint object). The joinpoint object will wait for all
shards to arrive and then generate and return the timestamp.

However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.

The fix is to give each table a separate joinpoint instance.

Message-Id: <1469117228-17879-1-git-send-email-tgrabiec@scylladb.com>
2016-07-21 19:14:57 +03:00
Vlad Zolotarov
e1480cd00d tracing: add a word "shard" to a "thread" value
Add a word "shard" to a "thread" column value.
From now its format is "shard X", where X is a shard index.

Fixes #1480

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1469099732-23334-1-git-send-email-vladz@cloudius-systems.com>
2016-07-21 15:08:09 +03:00
Paweł Dziepak
1c7c944eee Merge "Thrift: small cleanups" from Duarte
"This patchset adds several small thrift related cleanups."
2016-07-21 11:56:51 +01:00
Duarte Nunes
8991d35231 thrift: Use database::find_schema directly
This patch changes lookup_schema() so it directly calls
database::find_schema() instead of going through
database::find_column_family(). It also drops conversion of the
no_such_column_family exeption, as that is already handled at a higher
layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-21 10:37:38 +00:00
Duarte Nunes
038d42c589 thrift: Remove hardcoded version constant
...and use the one in thrift_server.hh instead.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-21 10:37:33 +00:00
Duarte Nunes
8bb43d09b1 thrift: Remove unused with_cob_dereference function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-21 10:13:55 +00:00
Paweł Dziepak
8a386a51bd Merge "Don't cache wide partitions" from Piotr
"When reading a partition try to read it all
but once more bytes are read than a given limit
we decide that partition is wide and we don't cache it.
Instead we retry the read with clustering key filtering
applied."
2016-07-21 10:24:25 +01:00
Benoît Canet
4ce7bced27 docker: Add documentation page for Docker Hub
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466438296-5593-1-git-send-email-benoit@scylladb.com>
2016-07-21 12:22:57 +03:00
Yoav Kleinberger
d1d1be4c1a docker: bring docker image closer to a more 'standard' scylla installation
Previously, the Docker image could only be run interactively, which is
not conducive for running clusters. This patch makes the docker image
run in the background (using systemd). This makes the docker workflow
similar to working with virtual machines, i.e. the user launches a
container, and once it is running they can connect to it with

       docker exec -it <container_name> bash

and immediately use `cqlsh` to control it.

In addition, the configuration of scylla is done using established
scripts, such as `scylla_dev_mode_setup`, `scylla_cpuset_setup` and
`scylla_io_setup`, whereas previously code from these scripts was
duplicated into the docker startup file.

To specify seeds for making a cluster, use the --seeds command line
argument, e.g.

    docker run -d --privileged scylladb/scylla
    docker run -d --privileged scylladb/scylla --seeds 172.17.0.2

other options include --developer-mode, --cpuset, --broadcast-address

The --developer-mode option mode is on by default - so that we don't fail users
who just want to play with this.

The Dockerfile entrypoint script was rewritten as a few Python modules.
The move to Python is meritted because:

    * Using `sed` to manipulate YAML is fragile
    * Lack of proper command line parsing resulted in introducing ad-hoc environment variables
    * Shell scripts don't throw exceptions, and it's easy to forget to check exit codes for every single command

I've made an effort to make the entrypoint `go' script very simple and readable.
The goary details are hidden inside the other python modules.

Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1468938693-32168-1-git-send-email-yoav@scylladb.com>
2016-07-21 12:20:39 +03:00
Duarte Nunes
a436cf945c thrift: Omit regular columns for dynamic CFs
This patch skips adding the auto-generated regular column when
describing a dynamic Column family for the describe_keyspace(s) verbs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469091720-10113-1-git-send-email-duarte@scylladb.com>
2016-07-21 12:06:10 +03:00
Pekka Enberg
d90f8dc2d9 Merge "tracing: cleanup" from Vlad 2016-07-21 11:57:54 +03:00
Avi Kivity
a9e07b292b Merge seastar upstream
* seastar 103543a...9d1db3f (8):
  > reactor: limit task backlog
  > iotune: Fix SIGFPE with some executions
  > Merge "Preparation for protobuf" from Amnon
  > byteorder: add missing cpu_to_be(), be_to_cpu() functions
  > rpc: fix gcc-7 compilation error
  > reactor: Register the smp metrics disabled
  > scollectd: Allow creating metric that is disabled
  > Merge "Propagate timeout to a server" from Gleb
2016-07-21 10:54:48 +03:00
Piotr Jastrzebski
7d29cdf81f Add tests for wide partiton handling in cache.
They shouldn't be cached.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:48:09 +02:00
Piotr Jastrzebski
37a7d49676 Add collectd counter for uncached wide partitions.
Keep track of every read of wide partition that's
not cached.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:47:49 +02:00
Piotr Jastrzebski
636a4acfd0 Add flag to configure
max size of a cached partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:47:20 +02:00
Piotr Jastrzebski
98c12dc2e2 Try to read whole streamed_mutation up to limit
If limit is exceeded then return the streamed_mutation
and don't cache it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:35:35 +02:00
Piotr Jastrzebski
0d39bb1ad0 Implement mutation_from_streamed_mutation_with_limit
If mutation is bigger than this limit
it won't be read and mutation_from_streamed_mutation
will return empty optional.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:35:23 +02:00
Avi Kivity
5e3b019447 Merge "Fix sstable reader for duplicated range tombstones" from Paweł
"This series fixes sstable reader so that it can handle duplicated range
tombstones which may appear if promoted index is used."
2016-07-21 10:13:29 +03:00
Vlad Zolotarov
276ef041d2 tracing: use "trace" log level instead of "debug".
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-20 19:50:40 +03:00
Vlad Zolotarov
d7d72c4cd4 tracing: "inline" cleanup
- Don't use inline for templates.
   - Put "inline" qualifier for out-of-class defined methods
     where they are defined and not where they are declared.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-20 19:49:30 +03:00
Vlad Zolotarov
5376b053f9 tracing: use seastar::format() for formatted trace()
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-20 19:49:30 +03:00
Duarte Nunes
8d87976fa1 tracing: Downgrade debug level log messages
Two periodic debug log level messages triggered by
tracing::write_timer_callback() quickly fill the logs, so this patch
downgrades them to trace level.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1469003034-1147-1-git-send-email-duarte@scylladb.com>
2016-07-20 11:34:02 +03:00
Pekka Enberg
aff8cf319d db/config: Start Thrift server by default
We have Thrift support now so start the server by default.
Message-Id: <1469002000-26767-1-git-send-email-penberg@scylladb.com>
2016-07-20 09:25:44 +01:00
Avi Kivity
5ceb55827d Merge "Add more tools to the gdb script" from Tomasz 2016-07-20 11:21:58 +03:00
Tomasz Grabiec
d2f0711608 scylla-gdb: Fix bounds checking in scylla ptr command
Message-Id: <1468951987-10184-1-git-send-email-tgrabiec@scylladb.com>
2016-07-20 11:20:45 +03:00
Duarte Nunes
64dff69077 thrift: Actually concatenate strings
This patch fixes concatenating a char[] with an int by using sprint
instead of just increasing the pointer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1468971542-9600-1-git-send-email-duarte@scylladb.com>
2016-07-20 11:08:44 +03:00
Tomasz Grabiec
0d26294fac database: Add table name to log message about sealing
Message-Id: <1468917744-2539-1-git-send-email-tgrabiec@scylladb.com>
2016-07-20 10:12:31 +03:00
Tomasz Grabiec
a0832f08d2 schema_tables: Add more logging
Message-Id: <1468917771-2592-1-git-send-email-tgrabiec@scylladb.com>
2016-07-20 10:12:00 +03:00
Pekka Enberg
7b5c2266b6 Merge "date tiered compaction strategy options" from Raphael
"After this patchset, user can now tune date tiered compaction strategy by
playing with its parameters."
2016-07-20 09:13:08 +03:00
Raphael S. Carvalho
cf54af9e58 tests: add new test for date tiered strategy
This test set the time window to 1 hour and checks that the strategy
works accordingly.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-19 19:20:40 -03:00
Raphael S. Carvalho
eaa6e281a2 compaction: implement date tiered compaction strategy options
Now date tiered compaction strategy will take into account the
strategy options which are defined in the schema.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-19 19:20:35 -03:00
Tomasz Grabiec
93a942dc53 scylla-gdb.py: Add "scylla mem-ranges" command
Prints memory ranges corresponding to seastar heap.

The ouput can be easily turned into arguemnts to "find".
2016-07-19 20:23:40 +02:00
Tomasz Grabiec
724f99641f scylla-gdb.py: Add "scylla shard" command
Beams you up to the thread corresponding to given seastar shard.

Example:

 (gdb) scylla shard 0
 Switched to thread 49
2016-07-19 20:23:40 +02:00
Tomasz Grabiec
d5fb452e94 scylla-gdb.py: Add "scylla apply" command
Executes given command on all shards.
2016-07-19 20:23:40 +02:00
Tomasz Grabiec
720d8149ba scylla-gdb.py: Add "scylla timers" command
Lists all active timers on current shard.
2016-07-19 20:23:40 +02:00
Tomasz Grabiec
6e4506a1f2 scylla-gdb.py: Add std::array wrapper which allows iteration 2016-07-19 20:23:40 +02:00
Tomasz Grabiec
29af5b22da scylla-gdb.py: Import intrusive_list() function
Imported from:

  https://github.com/cloudius-systems/osv/blob/master/scripts/loader.py
2016-07-19 20:23:40 +02:00
Paweł Dziepak
b405ff8ad2 tests/sstables: test reading sstable with duplicated range tombstones
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-19 15:13:27 +01:00
Paweł Dziepak
04f2c278c2 sstables: avoid recursion in sstable_streamed_mutation::read_next()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-19 15:12:43 +01:00
Paweł Dziepak
08032db269 sstables: protect against duplicated range tombstones
Promoted index may cause sstable to have range tombstones duplicated
several times. These duplicates appear in the "wrong" place since they
are smaller than the entity preceeding them.

This patch ignores such duplicates by skipping range tombstones that are
smaller than previously read ones.

Moreover, these duplicted range tombstone may appear in the middle of
clustering row, so the sstable reader has also gained the ability to
merge parts of the row in such cases.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-19 15:07:12 +01:00
Paweł Dziepak
50469e5ef3 tests: extract streamed_mutation assertions
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-19 14:45:36 +01:00
Calle Wilund
14b0fe23c5 commitlog: Ensure we don't end up in a loop when we must wait for alloc
Continuation reordering could cause us to repeatedly see the 
segment-local flag var even though actual write/sync ops are done. 
Can cause wild recursion without actual delayed continuation ->
SOE. 

Fix by also checking queue status, since this is the wait object.
2016-07-11 07:45:36 +00:00
473 changed files with 21119 additions and 8836 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

11
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,11 @@
# Asking questions or requesting help
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) for general questions and help.
# Reporting an issue
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
# Contributing Code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](http://www.scylladb.com/opensource/cla/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.

View File

@@ -8,14 +8,14 @@ In addition to required packages by Seastar, the following packages are required
Scylla uses submodules, so make sure you pull the submodules first by doing:
```
git submodule init
git submodule update --recursive
git submodule update --init --recursive
```
### Building and Running Scylla on Fedora
* Installing required packages:
```
sudo yum install yaml-cpp-devel lz4-devel zlib-devel snappy-devel jsoncpp-devel thrift-devel antlr3-tool antlr3-C++-devel libasan libubsan gcc-c++ gnutls-devel ninja-build ragel libaio-devel cryptopp-devel xfsprogs-devel numactl-devel hwloc-devel libpciaccess-devel libxml2-devel python3-pyparsing lksctp-tools-devel
sudo dnf install yaml-cpp-devel lz4-devel zlib-devel snappy-devel jsoncpp-devel thrift-devel antlr3-tool antlr3-C++-devel libasan libubsan gcc-c++ gnutls-devel ninja-build ragel libaio-devel cryptopp-devel xfsprogs-devel numactl-devel hwloc-devel libpciaccess-devel libxml2-devel python3-pyparsing lksctp-tools-devel protobuf-devel protobuf-compiler systemd-devel libunwind-devel
```
* Build Scylla
@@ -83,14 +83,6 @@ Run the image with:
docker run -p $(hostname -i):9042:9042 -i -t <image name>
```
## Contributing to Scylla
Do not send pull requests.
Send patches to the mailing list address scylladb-dev@googlegroups.com.
Be sure to subscribe.
In order for your patches to be merged, you must sign the Contributor's
License Agreement, protecting your rights and ours. See
http://www.scylladb.com/opensource/cla/.
[Guidelines for contributing](CONTRIBUTING.md)

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=1.6.6
if test -f version
then

View File

@@ -397,6 +397,36 @@
}
]
},
{
"path": "/cache_service/metrics/key/hits_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get key hits moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_key_hits_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/key/requests_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get key requests moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_key_requests_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/key/size",
"operations": [
@@ -607,6 +637,36 @@
}
]
},
{
"path": "/cache_service/metrics/counter/hits_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get counter hits moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_counter_hits_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/counter/requests_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get counter requests moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_counter_requests_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/counter/size",
"operations": [

View File

@@ -78,11 +78,19 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"split_output",
"description":"true if the output of the major compaction should be split in several sstables",
"required":false,
"allowMultiple":false,
"type":"bool",
"paramType":"query"
}
]
}
@@ -102,7 +110,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -129,7 +137,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -153,7 +161,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -180,7 +188,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -204,7 +212,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -244,7 +252,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -271,7 +279,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -298,7 +306,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -317,7 +325,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -349,7 +357,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -381,7 +389,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -405,7 +413,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -432,7 +440,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -459,7 +467,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -491,7 +499,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -518,7 +526,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -545,7 +553,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -569,7 +577,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -593,7 +601,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -633,7 +641,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -673,7 +681,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -713,7 +721,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -753,7 +761,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -793,7 +801,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -833,7 +841,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -873,7 +881,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -916,7 +924,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -943,7 +951,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -970,7 +978,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -994,7 +1002,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1034,7 +1042,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1058,7 +1066,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1101,7 +1109,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1144,7 +1152,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1203,7 +1211,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1243,7 +1251,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1267,7 +1275,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1310,7 +1318,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1353,7 +1361,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1412,7 +1420,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1452,7 +1460,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1492,7 +1500,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1532,7 +1540,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1572,7 +1580,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1612,7 +1620,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1652,7 +1660,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1692,7 +1700,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1732,7 +1740,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1772,7 +1780,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1812,7 +1820,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1852,7 +1860,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1892,7 +1900,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1932,7 +1940,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1972,7 +1980,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2012,7 +2020,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2052,7 +2060,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2092,7 +2100,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2116,7 +2124,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2156,7 +2164,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2196,7 +2204,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2236,7 +2244,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2276,7 +2284,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2300,7 +2308,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2324,7 +2332,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2351,7 +2359,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2378,7 +2386,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2405,7 +2413,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2432,7 +2440,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2501,7 +2509,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2525,7 +2533,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2549,7 +2557,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2573,7 +2581,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2597,7 +2605,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2621,7 +2629,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2645,7 +2653,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2669,7 +2677,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2693,7 +2701,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2717,7 +2725,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2741,7 +2749,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2765,7 +2773,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",

View File

@@ -21,8 +21,8 @@
"parameters":[
{
"name":"host",
"description":"The host name",
"required":true,
"description":"The host name. If absent, the local server broadcast/listen address is used",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
@@ -45,8 +45,8 @@
"parameters":[
{
"name":"host",
"description":"The host name",
"required":true,
"description":"The host name. If absent, the local server broadcast/listen address is used",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"

View File

@@ -42,6 +42,25 @@
}
]
},
{
"path":"/failure_detector/endpoint_phi_values",
"operations":[
{
"method":"GET",
"summary":"Get end point phi values",
"type":"array",
"items":{
"type":"endpoint_phi_values"
},
"nickname":"get_endpoint_phi_values",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/failure_detector/endpoints/",
"operations":[
@@ -202,6 +221,20 @@
"description": "The application state version"
}
}
},
"endpoint_phi_value": {
"id" : "endpoint_phi_value",
"description": "Holds phi value for a single end point",
"properties": {
"phi": {
"type": "double",
"description": "Phi value"
},
"endpoint": {
"type": "string",
"description": "end point address"
}
}
}
}
}

View File

@@ -777,7 +777,7 @@
]
},
{
"path": "/storage_proxy/metrics/read/moving_avrage_histogram",
"path": "/storage_proxy/metrics/read/moving_average_histogram",
"operations": [
{
"method": "GET",
@@ -792,7 +792,7 @@
]
},
{
"path": "/storage_proxy/metrics/range/moving_avrage_histogram",
"path": "/storage_proxy/metrics/range/moving_average_histogram",
"operations": [
{
"method": "GET",
@@ -942,7 +942,7 @@
]
},
{
"path": "/storage_proxy/metrics/write/moving_avrage_histogram",
"path": "/storage_proxy/metrics/write/moving_average_histogram",
"operations": [
{
"method": "GET",

View File

@@ -1201,11 +1201,12 @@
],
"parameters":[
{
"name":"non_system",
"description":"When set to true limit to non system",
"name":"type",
"description":"Which keyspaces to return",
"required":false,
"allowMultiple":false,
"type":"boolean",
"type":"string",
"enum": [ "all", "user", "non_local_strategy" ],
"paramType":"query"
}
]
@@ -1736,6 +1737,57 @@
}
]
},
{
"path":"/storage_service/slow_query",
"operations":[
{
"method":"POST",
"summary":"Set slow query parameter",
"type":"void",
"nickname":"set_slow_query",
"produces":[
"application/json"
],
"parameters":[
{
"name":"enable",
"description":"set it to true to enable, anything else to disable",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"ttl",
"description":"TTL in seconds",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"threshold",
"description":"Slow query record threshold in microseconds",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
},
{
"method":"GET",
"summary":"Returns the slow query record configuration.",
"type":"slow_query_info",
"nickname":"get_slow_query_info",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_service/auto_compaction/{keyspace}",
"operations":[
@@ -2133,6 +2185,24 @@
}
}
},
"slow_query_info": {
"id":"slow_query_info",
"description":"Slow query triggering information",
"properties":{
"enable":{
"type":"boolean",
"description":"Is slow query logging enable or disable"
},
"ttl":{
"type":"long",
"description":"The slow query TTL in seconds"
},
"threshold":{
"type":"long",
"description":"The slow query logging threshold in microseconds. Queries that takes longer, will be logged"
}
}
},
"endpoint_detail":{
"id":"endpoint_detail",
"description":"Endpoint detail",

View File

@@ -116,6 +116,7 @@ inline
httpd::utils_json::histogram to_json(const utils::ihistogram& val) {
httpd::utils_json::histogram h;
h = val;
h.sum = val.estimated_sum();
return h;
}
@@ -129,7 +130,7 @@ httpd::utils_json::rate_moving_average meter_to_json(const utils::rate_moving_av
inline
httpd::utils_json::rate_moving_average_and_histogram timer_to_json(const utils::rate_moving_average_and_histogram& val) {
httpd::utils_json::rate_moving_average_and_histogram h;
h.hist = val.hist;
h.hist = to_json(val.hist);
h.meter = meter_to_json(val.rate);
return h;
}
@@ -165,33 +166,36 @@ inline int64_t max_int64(int64_t a, int64_t b) {
* It combine total and the sub set for the ratio and its
* to_json method return the ration sub/total
*/
struct ratio_holder : public json::jsonable {
double total = 0;
double sub = 0;
template<typename T>
struct basic_ratio_holder : public json::jsonable {
T total = 0;
T sub = 0;
virtual std::string to_json() const {
if (total == 0) {
return "0";
}
return std::to_string(sub/total);
}
ratio_holder() = default;
ratio_holder& add(double _total, double _sub) {
basic_ratio_holder() = default;
basic_ratio_holder& add(T _total, T _sub) {
total += _total;
sub += _sub;
return *this;
}
ratio_holder(double _total, double _sub) {
basic_ratio_holder(T _total, T _sub) {
total = _total;
sub = _sub;
}
ratio_holder& operator+=(const ratio_holder& a) {
basic_ratio_holder<T>& operator+=(const basic_ratio_holder<T>& a) {
return add(a.total, a.sub);
}
friend ratio_holder operator+(ratio_holder a, const ratio_holder& b) {
friend basic_ratio_holder<T> operator+(basic_ratio_holder a, const basic_ratio_holder<T>& b) {
return a += b;
}
};
typedef basic_ratio_holder<double> ratio_holder;
typedef basic_ratio_holder<int64_t> integral_ratio_holder;
class unimplemented_exception : public base_exception {
public:

View File

@@ -177,6 +177,20 @@ void set_cache_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cs::get_key_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_key_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_key_size.set(r, [] (std::unique_ptr<request> req) {
// TBD
// FIXME
@@ -194,7 +208,7 @@ void set_cache_service(http_context& ctx, routes& r) {
});
cs::get_row_capacity.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {
return cf.get_row_cache().get_cache_tracker().region().occupancy().used_space();
}, std::plus<uint64_t>());
});
@@ -280,6 +294,20 @@ void set_cache_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cs::get_counter_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_counter_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_counter_size.set(r, [] (std::unique_ptr<request> req) {
// TBD
// FIXME

View File

@@ -24,7 +24,7 @@
#include <vector>
#include "http/exception.hh"
#include "sstables/sstables.hh"
#include "sstables/estimated_histogram.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
namespace api {
@@ -191,8 +191,8 @@ static double update_ratio(double acc, double f, double total) {
return acc;
}
static ratio_holder mean_row_size(column_family& cf) {
ratio_holder res;
static integral_ratio_holder mean_row_size(column_family& cf) {
integral_ratio_holder res;
for (auto i: *cf.get_sstables() ) {
auto c = i->get_stats_metadata().estimated_row_size.count();
res.sub += i->get_stats_metadata().estimated_row_size.mean() * c;
@@ -219,8 +219,9 @@ static future<json::json_return_type> sum_sstable(http_context& ctx, const sstr
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([uuid, total](database& db) {
std::unordered_map<sstring, uint64_t> m;
for (auto t :*((total) ? db.find_column_family(uuid).get_sstables_including_compacted_undeleted() :
db.find_column_family(uuid).get_sstables()).get()) {
auto sstables = (total) ? db.find_column_family(uuid).get_sstables_including_compacted_undeleted() :
db.find_column_family(uuid).get_sstables();
for (auto t : *sstables) {
m[t->get_filename()] = t->bytes_on_disk();
}
return m;
@@ -234,8 +235,9 @@ static future<json::json_return_type> sum_sstable(http_context& ctx, const sstr
static future<json::json_return_type> sum_sstable(http_context& ctx, bool total) {
return map_reduce_cf_raw(ctx, std::unordered_map<sstring, uint64_t>(), [total](column_family& cf) {
std::unordered_map<sstring, uint64_t> m;
for (auto t :*((total) ? cf.get_sstables_including_compacted_undeleted() :
cf.get_sstables()).get()) {
auto sstables = (total) ? cf.get_sstables_including_compacted_undeleted() :
cf.get_sstables();
for (auto t : *sstables) {
m[t->get_filename()] = t->bytes_on_disk();
}
return m;
@@ -273,6 +275,14 @@ static double get_compression_ratio(column_family& cf) {
return std::move(result).get();
}
static std::vector<uint64_t> concat_sstable_count_per_level(std::vector<uint64_t> a, std::vector<uint64_t>&& b) {
a.resize(std::max(a.size(), b.size()), 0UL);
for (auto i = 0U; i < b.size(); i++) {
a[i] += b[i];
}
return a;
}
void set_column_family(http_context& ctx, routes& r) {
cf::get_column_family_name.set(r, [&ctx] (const_req req){
vector<sstring> res;
@@ -393,14 +403,14 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {
sstables::estimated_histogram res(0);
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
res.merge(i->get_stats_metadata().estimated_row_size);
}
return res;
},
sstables::merge, utils_json::estimated_histogram());
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {
@@ -415,14 +425,14 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {
sstables::estimated_histogram res(0);
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
res.merge(i->get_stats_metadata().estimated_column_count);
}
return res;
},
sstables::merge, utils_json::estimated_histogram());
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
cf::get_all_compression_ratio.set(r, [] (std::unique_ptr<request> req) {
@@ -552,11 +562,13 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), mean_row_size, std::plus<ratio_holder>());
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, req->param["name"], integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());
});
cf::get_all_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, ratio_holder(), mean_row_size, std::plus<ratio_holder>());
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());
});
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
@@ -797,10 +809,10 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_sstables_per_read_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return cf.get_stats().estimated_sstable_per_read;
},
sstables::merge, utils_json::estimated_histogram());
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
cf::get_tombstone_scanned_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
@@ -859,17 +871,17 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_read_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return cf.get_stats().estimated_read;
},
sstables::merge, utils_json::estimated_histogram());
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
cf::get_write_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return cf.get_stats().estimated_write;
},
sstables::merge, utils_json::estimated_histogram());
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<request> req) {
@@ -898,12 +910,11 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_sstable_count_per_level.set(r, [&ctx](std::unique_ptr<request> req) {
// TBD
// FIXME
// This is a workaround, until there will be an API to return the count
// per level, we return an empty array
vector<uint64_t> res;
return make_ready_future<json::json_return_type>(res);
return map_reduce_cf_raw(ctx, req->param["name"], std::vector<uint64_t>(), [](const column_family& cf) {
return cf.sstable_count_per_level();
}, concat_sstable_count_per_level).then([](const std::vector<uint64_t>& res) {
return make_ready_future<json::json_return_type>(res);
});
});
}
}

View File

@@ -22,6 +22,7 @@
#include "compaction_manager.hh"
#include "api/api-doc/compaction_manager.json.hh"
#include "db/system_keyspace.hh"
#include "column_family.hh"
namespace api {
@@ -78,7 +79,9 @@ void set_compaction_manager(http_context& ctx, routes& r) {
});
cm::get_pending_tasks.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cm_stats(ctx, &compaction_manager::stats::pending_tasks);
return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf);
}, std::plus<int64_t>());
});
cm::get_completed_tasks.set(r, [&ctx] (std::unique_ptr<request> req) {

View File

@@ -22,16 +22,22 @@
#include "locator/snitch_base.hh"
#include "endpoint_snitch.hh"
#include "api/api-doc/endpoint_snitch_info.json.hh"
#include "utils/fb_utilities.hh"
namespace api {
void set_endpoint_snitch(http_context& ctx, routes& r) {
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(req.get_query_param("host"));
static auto host_or_broadcast = [](const_req req) {
auto host = req.get_query_param("host");
return host.empty() ? gms::inet_address(utils::fb_utilities::get_broadcast_address()) : gms::inet_address(host);
};
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(host_or_broadcast(req));
});
httpd::endpoint_snitch_info_json::get_rack.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(req.get_query_param("host"));
httpd::endpoint_snitch_info_json::get_rack.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(host_or_broadcast(req));
});
httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [] (const_req req) {

View File

@@ -88,6 +88,20 @@ void set_failure_detector(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(state);
});
});
fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {
return gms::get_arrival_samples().then([](std::map<gms::inet_address, gms::arrival_window> map) {
std::vector<fd::endpoint_phi_value> res;
auto now = gms::arrival_window::clk::now();
for (auto& p : map) {
fd::endpoint_phi_value val;
val.endpoint = p.first.to_sstring();
val.phi = p.second.phi(now);
res.emplace_back(std::move(val));
}
return make_ready_future<json::json_return_type>(res);
});
});
}
}

View File

@@ -52,9 +52,9 @@ static future<json::json_return_type> sum_timed_rate_as_long(distributed<proxy>
});
}
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, sstables::estimated_histogram proxy::stats::*f) {
return ctx.sp.map_reduce0([f](const proxy& p) {return p.get_stats().*f;}, sstables::estimated_histogram(),
sstables::merge).then([](const sstables::estimated_histogram& val) {
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::estimated_histogram proxy::stats::*f) {
return ctx.sp.map_reduce0([f](const proxy& p) {return p.get_stats().*f;}, utils::estimated_histogram(),
utils::estimated_histogram_merge).then([](const utils::estimated_histogram& val) {
utils_json::estimated_histogram res;
res = val;
return make_ready_future<json::json_return_type>(res);

View File

@@ -22,6 +22,8 @@
#include "storage_service.hh"
#include "api/api-doc/storage_service.json.hh"
#include "db/config.hh"
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <service/storage_service.hh>
#include <db/commitlog/commitlog.hh>
#include <gms/gossiper.hh>
@@ -457,8 +459,15 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_keyspaces.set(r, [&ctx](const_req req) {
auto non_system = req.get_query_param("non_system");
return map_keys(ctx.db.local().keyspaces());
auto type = req.get_query_param("type");
if (type == "user") {
return ctx.db.local().get_non_system_keyspaces();
} else if (type == "non_local_strategy") {
return map_keys(ctx.db.local().get_keyspaces() | boost::adaptors::filtered([](const auto& p) {
return p.second.get_replication_strategy().get_type() != locator::replication_strategy_type::local;
}));
}
return map_keys(ctx.db.local().get_keyspaces());
});
ss::update_snitch.set(r, [](std::unique_ptr<request> req) {
@@ -681,6 +690,37 @@ void set_storage_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(tracing::tracing::get_local_tracing_instance().get_trace_probability());
});
ss::get_slow_query_info.set(r, [](const_req req) {
ss::slow_query_info res;
res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();
res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;
res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();
return res;
});
ss::set_slow_query.set(r, [](std::unique_ptr<request> req) {
auto enable = req->get_query_param("enable");
auto ttl = req->get_query_param("ttl");
auto threshold = req->get_query_param("threshold");
try {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold] (auto& local_tracing) {
if (threshold != "") {
local_tracing.set_slow_query_threshold(std::chrono::microseconds(std::stol(threshold.c_str())));
}
if (ttl != "") {
local_tracing.set_slow_query_record_ttl(std::chrono::seconds(std::stol(ttl.c_str())));
}
if (enable != "") {
local_tracing.set_slow_query_enabled(strcasecmp(enable.c_str(), "true") == 0);
}
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
} catch (...) {
throw httpd::bad_param_exception(sprint("Bad format value: "));
}
});
ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
//TBD
unimplemented();

View File

@@ -236,11 +236,19 @@ public:
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value) {
return atomic_cell_type::make_live(timestamp, value);
}
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value) {
return make_live(timestamp, bytes_view(value));
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl)
{
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
}
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value,
gc_clock::time_point expiry, gc_clock::duration ttl)
{
return make_live(timestamp, bytes_view(value), expiry, ttl);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value, ttl_opt ttl) {
if (!ttl) {
return atomic_cell_type::make_live(timestamp, value);

View File

@@ -63,8 +63,8 @@ public:
::feed_hash(as_collection_mutation(), h, def.type);
}
}
size_t memory_usage() const {
return _data.memory_usage();
size_t external_memory_usage() const {
return _data.external_memory_usage();
}
friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);
};

View File

@@ -243,7 +243,8 @@ future<> auth::auth::setup() {
std::map<sstring, sstring> opts;
opts["replication_factor"] = "1";
auto ksm = keyspace_metadata::new_keyspace(AUTH_KS, "org.apache.cassandra.locator.SimpleStrategy", opts, true);
f = service::get_local_migration_manager().announce_new_keyspace(ksm, false);
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments. See issue #2129.
f = service::get_local_migration_manager().announce_new_keyspace(ksm, api::min_timestamp, false);
}
return f.then([] {
@@ -353,7 +354,7 @@ future<> auth::auth::setup_table(const sstring& name, const sstring& cql) {
parsed->prepare_keyspace(AUTH_KS);
::shared_ptr<cql3::statements::create_table_statement> statement =
static_pointer_cast<cql3::statements::create_table_statement>(
parsed->prepare(db)->statement);
parsed->prepare(db, qp.get_cql_stats())->statement);
auto schema = statement->get_cf_meta_data();
auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());

View File

@@ -47,11 +47,8 @@
const sstring auth::data_resource::ROOT_NAME("data");
auth::data_resource::data_resource(level l, const sstring& ks, const sstring& cf)
: _ks(ks), _cf(cf)
: _level(l), _ks(ks), _cf(cf)
{
if (l != get_level()) {
throw std::invalid_argument("level/keyspace/column mismatch");
}
}
auth::data_resource::data_resource()
@@ -67,14 +64,7 @@ auth::data_resource::data_resource(const sstring& ks, const sstring& cf)
{}
auth::data_resource::level auth::data_resource::get_level() const {
if (!_cf.empty()) {
assert(!_ks.empty());
return level::COLUMN_FAMILY;
}
if (!_ks.empty()) {
return level::KEYSPACE;
}
return level::ROOT;
return _level;
}
auth::data_resource auth::data_resource::from_name(

View File

@@ -56,6 +56,7 @@ private:
static const sstring ROOT_NAME;
level _level;
sstring _ks;
sstring _cf;

View File

@@ -218,12 +218,12 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
// obsolete prepared statements pretty quickly.
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
auto& qp = cql3::get_local_query_processor();
return qp.process(
sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME),
consistency_for_user(username), { username }, true).then_wrapped(
[=](future<::shared_ptr<cql3::untyped_result_set>> f) {
return futurize_apply([this, username, password] {
auto& qp = cql3::get_local_query_processor();
return qp.process(sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME),
consistency_for_user(username), {username}, true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {
@@ -234,6 +234,8 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));
} catch (exceptions::request_execution_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.what()));
} catch (...) {
std::throw_with_nested(exceptions::authentication_exception("authentication failed"));
}
});
}

View File

@@ -40,6 +40,7 @@
*/
#include <unordered_map>
#include <boost/algorithm/string.hpp>
#include "permission.hh"
const auth::permission_set auth::permissions::ALL_DATA =
@@ -75,7 +76,9 @@ const sstring& auth::permissions::to_string(permission p) {
}
auth::permission auth::permissions::from_string(const sstring& s) {
return permission_names.at(s);
sstring upper(s);
boost::to_upper(upper);
return permission_names.at(upper);
}
std::unordered_set<sstring> auth::permissions::to_strings(const permission_set& set) {

View File

@@ -38,6 +38,7 @@ class bytes_ostream {
public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
static constexpr size_type max_chunk_size() { return 16 * 1024; }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
@@ -58,7 +59,6 @@ private:
};
// FIXME: consider increasing chunk size as the buffer grows
static constexpr size_type chunk_size{512};
static constexpr size_type usable_chunk_size{chunk_size - sizeof(chunk)};
private:
std::unique_ptr<chunk> _begin;
chunk* _current;
@@ -99,6 +99,19 @@ private:
}
return _current->size - _current->offset;
}
// Figure out next chunk size.
// - must be enough for data_size
// - must be at least chunk_size
// - try to double each time to prevent too many allocations
// - do not exceed max_chunk_size
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: chunk_size;
next_size = std::min(next_size, max_chunk_size());
// FIXME: check for overflow?
return std::max<size_type>(next_size, data_size + sizeof(chunk));
}
// Makes room for a contiguous region of given size.
// The region is accounted for as already written.
// size must not be zero.
@@ -109,7 +122,7 @@ private:
_size += size;
return ret;
} else {
auto alloc_size = size <= usable_chunk_size ? chunk_size : (size + sizeof(chunk));
auto alloc_size = next_alloc_size(size);
auto space = malloc(alloc_size);
if (!space) {
throw std::bad_alloc();
@@ -153,19 +166,18 @@ public:
}
bytes_ostream& operator=(const bytes_ostream& o) {
_size = 0;
_current = nullptr;
_begin = {};
append(o);
if (this != &o) {
auto x = bytes_ostream(o);
*this = std::move(x);
}
return *this;
}
bytes_ostream& operator=(bytes_ostream&& o) noexcept {
_size = o._size;
_begin = std::move(o._begin);
_current = o._current;
o._current = nullptr;
o._size = 0;
if (this != &o) {
this->~bytes_ostream();
new (this) bytes_ostream(std::move(o));
}
return *this;
}
@@ -174,7 +186,7 @@ public:
value_type* ptr;
// makes the place_holder looks like a stream
seastar::simple_output_stream get_stream() {
return seastar::simple_output_stream{reinterpret_cast<char*>(ptr)};
return seastar::simple_output_stream(reinterpret_cast<char*>(ptr), sizeof(T));
}
};
@@ -195,19 +207,19 @@ public:
if (v.empty()) {
return;
}
auto space_left = current_space_left();
if (v.size() <= space_left) {
memcpy(_current->data + _current->offset, v.begin(), v.size());
_current->offset += v.size();
_size += v.size();
} else {
if (space_left) {
memcpy(_current->data + _current->offset, v.begin(), space_left);
_current->offset += space_left;
_size += space_left;
v.remove_prefix(space_left);
}
memcpy(alloc(v.size()), v.begin(), v.size());
auto this_size = std::min(v.size(), size_t(current_space_left()));
if (this_size) {
memcpy(_current->data + _current->offset, v.begin(), this_size);
_current->offset += this_size;
_size += this_size;
v.remove_prefix(this_size);
}
while (!v.empty()) {
auto this_size = std::min(v.size(), size_t(max_chunk_size()));
std::copy_n(v.begin(), this_size, alloc(this_size));
v.remove_prefix(this_size);
}
}
@@ -272,13 +284,8 @@ public:
}
void append(const bytes_ostream& o) {
if (o.size() > 0) {
auto dst = alloc(o.size());
auto r = o._begin.get();
while (r) {
dst = std::copy_n(r->data, r->offset, dst);
r = r->next.get();
}
for (auto&& bv : o.fragments()) {
write(bv);
}
}
@@ -328,6 +335,45 @@ public:
_current->next = nullptr;
_current->offset = pos._offset;
}
void reduce_chunk_count() {
// FIXME: This is a simplified version. It linearizes the whole buffer
// if its size is below max_chunk_size. We probably could also gain
// some read performance by doing "real" reduction, i.e. merging
// all chunks until all but the last one is max_chunk_size.
if (size() < max_chunk_size()) {
linearize();
}
}
bool operator==(const bytes_ostream& other) const {
auto as = fragments().begin();
auto as_end = fragments().end();
auto bs = other.fragments().begin();
auto bs_end = other.fragments().end();
auto a = *as++;
auto b = *bs++;
while (!a.empty() || !b.empty()) {
auto now = std::min(a.size(), b.size());
if (!std::equal(a.begin(), a.begin() + now, b.begin(), b.begin() + now)) {
return false;
}
a.remove_prefix(now);
if (a.empty() && as != as_end) {
a = *as++;
}
b.remove_prefix(now);
if (b.empty() && bs != bs_end) {
b = *bs++;
}
}
return true;
}
bool operator!=(const bytes_ostream& other) const {
return !(*this == other);
}
};
template<>

View File

@@ -44,7 +44,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
mutation_partition_serializer part_ser(*m.schema(), m.partition());
bytes_ostream out;
ser::writer_of_canonical_mutation wr(out);
ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())

View File

@@ -27,121 +27,125 @@
class checked_file_impl : public file_impl {
public:
checked_file_impl(disk_error_signal_type& s, file f)
: _signal(s) , _file(f) {}
checked_file_impl(const io_error_handler& error_handler, file f)
: _error_handler(error_handler), _file(f) {
_memory_dma_alignment = f.memory_dma_alignment();
_disk_read_dma_alignment = f.disk_read_dma_alignment();
_disk_write_dma_alignment = f.disk_write_dma_alignment();
}
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, buffer, len, pc);
});
}
virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, iov, pc);
});
}
virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, buffer, len, pc);
});
}
virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, iov, pc);
});
}
virtual future<> flush(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->flush();
});
}
virtual future<struct stat> stat(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->stat();
});
}
virtual future<> truncate(uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->truncate(length);
});
}
virtual future<> discard(uint64_t offset, uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->discard(offset, length);
});
}
virtual future<> allocate(uint64_t position, uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->allocate(position, length);
});
}
virtual future<uint64_t> size(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->size();
});
}
virtual future<> close() override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->close();
});
}
virtual subscription<directory_entry> list_directory(std::function<future<> (directory_entry de)> next) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->list_directory(next);
});
}
private:
disk_error_signal_type &_signal;
const io_error_handler& _error_handler;
file _file;
};
inline file make_checked_file(disk_error_signal_type& signal, file& f)
inline file make_checked_file(const io_error_handler& error_handler, file& f)
{
return file(::make_shared<checked_file_impl>(signal, f));
return file(::make_shared<checked_file_impl>(error_handler, f));
}
future<file>
inline open_checked_file_dma(disk_error_signal_type& signal,
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags,
file_open_options options)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags, options).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}
future<file>
inline open_checked_file_dma(disk_error_signal_type& signal,
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}
future<file>
inline open_checked_directory(disk_error_signal_type& signal,
inline open_checked_directory(const io_error_handler& error_handler,
sstring name)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return engine().open_directory(name).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}

View File

@@ -0,0 +1,127 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "keys.hh"
#include "schema.hh"
#include "range.hh"
/**
* Represents the kind of bound in a range tombstone.
*/
enum class bound_kind : uint8_t {
excl_end = 0,
incl_start = 1,
// values 2 to 5 are reserved for forward Origin compatibility
incl_end = 6,
excl_start = 7,
};
std::ostream& operator<<(std::ostream& out, const bound_kind k);
bound_kind invert_kind(bound_kind k);
int32_t weight(bound_kind k);
static inline bound_kind flip_bound_kind(bound_kind bk)
{
switch (bk) {
case bound_kind::excl_end: return bound_kind::excl_start;
case bound_kind::incl_end: return bound_kind::incl_start;
case bound_kind::excl_start: return bound_kind::excl_end;
case bound_kind::incl_start: return bound_kind::incl_end;
}
abort();
}
class bound_view {
const static thread_local clustering_key empty_prefix;
public:
const clustering_key_prefix& prefix;
bound_kind kind;
bound_view(const clustering_key_prefix& prefix, bound_kind kind)
: prefix(prefix)
, kind(kind)
{ }
struct compare {
// To make it assignable and to avoid taking a schema_ptr, we
// wrap the schema reference.
std::reference_wrapper<const schema> _s;
compare(const schema& s) : _s(s)
{ }
bool operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
auto type = _s.get().clustering_key_prefix_type();
auto res = prefix_equality_tri_compare(type->types().begin(),
type->begin(p1), type->end(p1),
type->begin(p2), type->end(p2),
tri_compare);
if (res) {
return res < 0;
}
auto d1 = p1.size(_s);
auto d2 = p2.size(_s);
if (d1 == d2) {
return w1 < w2;
}
return d1 < d2 ? w1 <= 0 : w2 > 0;
}
bool operator()(const bound_view b, const clustering_key_prefix& p) const {
return operator()(b.prefix, weight(b.kind), p, 0);
}
bool operator()(const clustering_key_prefix& p, const bound_view b) const {
return operator()(p, 0, b.prefix, weight(b.kind));
}
bool operator()(const bound_view b1, const bound_view b2) const {
return operator()(b1.prefix, weight(b1.kind), b2.prefix, weight(b2.kind));
}
};
bool equal(const schema& s, const bound_view other) const {
return kind == other.kind && prefix.equal(s, other.prefix);
}
bool adjacent(const schema& s, const bound_view other) const {
return invert_kind(other.kind) == kind && prefix.equal(s, other.prefix);
}
static bound_view bottom() {
return {empty_prefix, bound_kind::incl_start};
}
static bound_view top() {
return {empty_prefix, bound_kind::incl_end};
}
/*
template<template<typename> typename T, typename U>
concept bool Range() {
return requires (T<U> range) {
{ range.start() } -> stdx::optional<U>;
{ range.end() } -> stdx::optional<U>;
};
};*/
template<template<typename> typename Range>
static std::pair<bound_view, bound_view> from_range(const Range<clustering_key_prefix>& range) {
return {
range.start() ? bound_view(range.start()->value(), range.start()->is_inclusive() ? bound_kind::incl_start : bound_kind::excl_start) : bottom(),
range.end() ? bound_view(range.end()->value(), range.end()->is_inclusive() ? bound_kind::incl_end : bound_kind::excl_end) : top(),
};
}
friend std::ostream& operator<<(std::ostream& out, const bound_view& b) {
return out << "{bound: prefix=" << b.prefix << ", kind=" << b.kind << "}";
}
};

View File

@@ -1,130 +0,0 @@
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "clustering_key_filter.hh"
#include "keys.hh"
#include "query-request.hh"
#include "range.hh"
namespace query {
const std::vector<range<clustering_key_prefix>>&
clustering_key_filtering_context::get_ranges(const partition_key& key) const {
static thread_local std::vector<range<clustering_key_prefix>> full_range = {{}};
return _factory ? _factory->get_ranges(key) : full_range;
}
clustering_key_filtering_context clustering_key_filtering_context::create_no_filtering() {
return clustering_key_filtering_context{};
}
const clustering_key_filtering_context no_clustering_key_filtering =
clustering_key_filtering_context::create_no_filtering();
class stateless_clustering_key_filter_factory : public clustering_key_filter_factory {
clustering_key_filter _filter;
std::vector<range<clustering_key_prefix>> _ranges;
public:
stateless_clustering_key_filter_factory(std::vector<range<clustering_key_prefix>>&& ranges,
clustering_key_filter&& filter)
: _filter(std::move(filter)), _ranges(std::move(ranges)) {}
virtual clustering_key_filter get_filter(const partition_key& key) override {
return _filter;
}
virtual clustering_key_filter get_filter_for_sorted(const partition_key& key) override {
return _filter;
}
virtual const std::vector<range<clustering_key_prefix>>& get_ranges(const partition_key& key) override {
return _ranges;
}
};
class partition_slice_clustering_key_filter_factory : public clustering_key_filter_factory {
schema_ptr _schema;
const partition_slice& _slice;
clustering_key_prefix::prefix_equal_tri_compare _cmp;
query::clustering_row_ranges _ck_ranges;
public:
partition_slice_clustering_key_filter_factory(schema_ptr s, const partition_slice& slice)
: _schema(std::move(s)), _slice(slice), _cmp(*_schema) {}
virtual clustering_key_filter get_filter(const partition_key& key) override {
const clustering_row_ranges& ranges = _slice.row_ranges(*_schema, key);
return [this, &ranges] (const clustering_key& key) {
return std::any_of(std::begin(ranges), std::end(ranges),
[this, &key] (const range<clustering_key_prefix>& r) { return r.contains(key, _cmp); });
};
}
virtual clustering_key_filter get_filter_for_sorted(const partition_key& key) override {
const clustering_row_ranges& ranges = _slice.row_ranges(*_schema, key);
return [this, &ranges] (const clustering_key& key) {
return std::any_of(std::begin(ranges), std::end(ranges),
[this, &key] (const range<clustering_key_prefix>& r) { return r.contains(key, _cmp); });
};
}
virtual const std::vector<range<clustering_key_prefix>>& get_ranges(const partition_key& key) override {
if (_slice.options.contains(query::partition_slice::option::reversed)) {
_ck_ranges = _slice.row_ranges(*_schema, key);
std::reverse(_ck_ranges.begin(), _ck_ranges.end());
return _ck_ranges;
}
return _slice.row_ranges(*_schema, key);
}
};
static const shared_ptr<clustering_key_filter_factory>
create_partition_slice_filter(schema_ptr s, const partition_slice& slice) {
return ::make_shared<partition_slice_clustering_key_filter_factory>(std::move(s), slice);
}
const clustering_key_filtering_context
clustering_key_filtering_context::create(schema_ptr schema, const partition_slice& slice) {
static thread_local clustering_key_filtering_context accept_all = clustering_key_filtering_context(
::make_shared<stateless_clustering_key_filter_factory>(std::vector<range<clustering_key_prefix>>{{}},
[](const clustering_key&) { return true; }));
static thread_local clustering_key_filtering_context reject_all = clustering_key_filtering_context(
::make_shared<stateless_clustering_key_filter_factory>(std::vector<range<clustering_key_prefix>>{},
[](const clustering_key&) { return false; }));
if (slice.get_specific_ranges()) {
return clustering_key_filtering_context(create_partition_slice_filter(schema, slice));
}
const clustering_row_ranges& ranges = slice.default_row_ranges();
if (ranges.empty()) {
return reject_all;
}
if (ranges.size() == 1 && ranges[0].is_full()) {
return accept_all;
}
return clustering_key_filtering_context(create_partition_slice_filter(schema, slice));
}
}

View File

@@ -22,54 +22,46 @@
*/
#pragma once
#include <functional>
#include <vector>
#include "core/shared_ptr.hh"
#include "database_fwd.hh"
#include "schema.hh"
template<typename T> class range;
#include "query-request.hh"
namespace query {
class partition_slice;
// A predicate that tells if a clustering key should be accepted.
using clustering_key_filter = std::function<bool(const clustering_key&)>;
// A factory for clustering key filter which can be reused for multiple clustering keys.
class clustering_key_filter_factory {
class clustering_key_filter_ranges {
clustering_row_ranges _storage;
const clustering_row_ranges& _ref;
public:
// Create a clustering key filter that can be used for multiple clustering keys with no restrictions.
virtual clustering_key_filter get_filter(const partition_key&) = 0;
// Create a clustering key filter that can be used for multiple clustering keys but they have to be sorted.
virtual clustering_key_filter get_filter_for_sorted(const partition_key&) = 0;
virtual const std::vector<range<clustering_key_prefix>>& get_ranges(const partition_key&) = 0;
virtual ~clustering_key_filter_factory() = default;
};
clustering_key_filter_ranges(const clustering_row_ranges& ranges) : _ref(ranges) { }
struct reversed { };
clustering_key_filter_ranges(reversed, const clustering_row_ranges& ranges)
: _storage(ranges.rbegin(), ranges.rend()), _ref(_storage) { }
class clustering_key_filtering_context {
private:
shared_ptr<clustering_key_filter_factory> _factory;
clustering_key_filtering_context() {};
clustering_key_filtering_context(shared_ptr<clustering_key_filter_factory> factory) : _factory(factory) {}
public:
// Create a clustering key filter that can be used for multiple clustering keys with no restrictions.
clustering_key_filter get_filter(const partition_key& key) const {
return _factory ? _factory->get_filter(key) : [] (const clustering_key&) { return true; };
clustering_key_filter_ranges(clustering_key_filter_ranges&& other) noexcept
: _storage(std::move(other._storage))
, _ref(&other._ref == &other._storage ? _storage : other._ref)
{ }
clustering_key_filter_ranges& operator=(clustering_key_filter_ranges&& other) noexcept {
if (this != &other) {
this->~clustering_key_filter_ranges();
new (this) clustering_key_filter_ranges(std::move(other));
}
return *this;
}
// Create a clustering key filter that can be used for multiple clustering keys but they have to be sorted.
clustering_key_filter get_filter_for_sorted(const partition_key& key) const {
return _factory ? _factory->get_filter_for_sorted(key) : [] (const clustering_key&) { return true; };
auto begin() const { return _ref.begin(); }
auto end() const { return _ref.end(); }
bool empty() const { return _ref.empty(); }
size_t size() const { return _ref.size(); }
static clustering_key_filter_ranges get_ranges(const schema& schema, const query::partition_slice& slice, const partition_key& key) {
const query::clustering_row_ranges& ranges = slice.row_ranges(schema, key);
if (slice.options.contains(query::partition_slice::option::reversed)) {
return clustering_key_filter_ranges(clustering_key_filter_ranges::reversed{}, ranges);
}
return clustering_key_filter_ranges(ranges);
}
const std::vector<range<clustering_key_prefix>>& get_ranges(const partition_key& key) const;
static const clustering_key_filtering_context create(schema_ptr, const partition_slice&);
static clustering_key_filtering_context create_no_filtering();
};
extern const clustering_key_filtering_context no_clustering_key_filtering;
}

View File

@@ -54,9 +54,16 @@ public:
// Return a list of sstables to be compacted after applying the strategy.
compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<lw_shared_ptr<sstable>> candidates);
// Some strategies may look at the compacted and resulting sstables to
// get some useful information for subsequent compactions.
void notify_completion(const std::vector<lw_shared_ptr<sstable>>& removed, const std::vector<lw_shared_ptr<sstable>>& added);
// Return if parallel compaction is allowed by strategy.
bool parallel_compaction() const;
// Return if optimization to rule out sstables based on clustering key filter should be applied.
bool use_clustering_key_filter() const;
// An estimation of number of compaction for strategy to be satisfied.
int64_t estimated_pending_compactions(column_family& cf) const;

View File

@@ -39,6 +39,9 @@ public:
compatible_ring_position(const schema& s, dht::ring_position&& rp)
: _schema(&s), _rp(std::move(rp)) {
}
const dht::token& token() const {
return _rp->token();
}
friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
return x._rp->tri_compare(*x._schema, *y._rp);
}

View File

@@ -414,8 +414,12 @@ public:
return _bytes.empty();
}
static bool is_static(bytes_view bytes, bool is_compound) {
return is_compound && bytes.size() > 2 && (bytes[0] & bytes[1] & 0xff) == 0xff;
}
bool is_static() const {
return size() > 2 && (_bytes.at(0) & _bytes.at(1) & 0xff) == 0xff;
return is_static(_bytes, _is_compound);
}
bool is_compound() const {
@@ -514,7 +518,7 @@ public:
}
bool is_static() const {
return size() > 2 && (_bytes.at(0) & _bytes.at(1)) == 0xff;
return composite::is_static(_bytes, _is_compound);
}
explicit operator bytes_view() const {

View File

@@ -39,17 +39,17 @@ public:
static constexpr auto CHUNK_LENGTH_KB = "chunk_length_kb";
static constexpr auto CRC_CHECK_CHANCE = "crc_check_chance";
private:
compressor _compressor = compressor::none;
compressor _compressor;
std::experimental::optional<int> _chunk_length;
std::experimental::optional<double> _crc_check_chance;
public:
compression_parameters() = default;
compression_parameters(compressor c) : _compressor(c) { }
compression_parameters(compressor c = compressor::lz4) : _compressor(c) { }
compression_parameters(const std::map<sstring, sstring>& options) {
validate_options(options);
auto it = options.find(SSTABLE_COMPRESSION);
if (it == options.end() || it->second.empty()) {
_compressor = compressor::none;
return;
}
const auto& compressor_class = it->second;

2
conf/housekeeping.cfg Normal file
View File

@@ -0,0 +1,2 @@
[housekeeping]
check-version: True

View File

@@ -211,16 +211,16 @@ batch_size_warn_threshold_in_kb: 5
# increase system_auth keyspace replication factor if you use this authorizer.
# authorizer: AllowAllAuthorizer
###################################################
## Not currently supported, reserved for future use
###################################################
# initial_token allows you to specify tokens manually. While you can use # it with
# vnodes (num_tokens > 1, above) -- in which case you should provide a
# comma-separated list -- it's primarily used when adding nodes # to legacy clusters
# that do not have vnodes enabled.
# initial_token:
###################################################
## Not currently supported, reserved for future use
###################################################
# See http://wiki.apache.org/cassandra/HintedHandoff
# May either be "true" or "false" to enable globally, or contain a list
# of data centers to enable per-datacenter.
@@ -409,29 +409,6 @@ partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# the smaller of 1/4 of heap or 512MB.
# file_cache_size_in_mb: 512
# Total permitted memory to use for memtables. Scylla will stop
# accepting writes when the limit is exceeded until a flush completes,
# and will trigger a flush based on memtable_cleanup_threshold
# If omitted, Scylla will set both to 1/4 the size of the heap.
# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048
# Ratio of occupied non-flushing memtable size to total permitted size
# that will trigger a flush of the largest memtable. Lager mct will
# mean larger flushes and hence less compaction, but also less concurrent
# flush activity which can make it difficult to keep your disks fed
# under heavy write load.
#
# memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1)
# memtable_cleanup_threshold: 0.11
# Specify the way Scylla allocates and manages memtable memory.
# Options are:
# heap_buffers: on heap nio buffers
# offheap_buffers: off heap (direct) nio buffers
# offheap_objects: native memory, eliminating nio buffer heap overhead
# memtable_allocation_type: heap_buffers
# Total space to use for commitlogs.
#
# If space gets above this value (it will round up to the next nearest
@@ -443,17 +420,6 @@ partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# available for Scylla.
commitlog_total_space_in_mb: -1
# This sets the amount of memtable flush writer threads. These will
# be blocked by disk io, and each one will hold a memtable in memory
# while blocked.
#
# memtable_flush_writers defaults to the smaller of (number of disks,
# number of cores), with a minimum of 2 and a maximum of 8.
#
# If your data directories are backed by SSD, you should increase this
# to the number of cores.
#memtable_flush_writers: 8
# A fixed memory pool size in MB for for SSTable index summaries. If left
# empty, this will default to 5% of the heap size. If the memory usage of
# all index summaries exceeds this limit, SSTables with low read rates will
@@ -784,7 +750,7 @@ commitlog_total_space_in_mb: -1
# can be: all - all traffic is compressed
# dc - traffic between different datacenters is compressed
# none - nothing is compressed.
# internode_compression: all
# internode_compression: none
# Enable or disable tcp_nodelay for inter-dc communication.
# Disabling it will result in larger (but fewer) network packets being sent,
@@ -813,3 +779,13 @@ commitlog_total_space_in_mb: -1
# freeing processor resources when there is other work to be done.
#
# defragment_memory_on_idle: true
#
# prometheus port
# By default, Scylla opens prometheus API port on port 9180
# setting the port to 0 will disable the prometheus API.
# prometheus_port: 9180
#
# prometheus address
# By default, Scylla binds all interfaces to the prometheus API
# It is possible to restrict the listening address to a specific one
# prometheus_address: 0.0.0.0

View File

@@ -108,6 +108,11 @@ def debug_flag(compiler):
print('Note: debug information disabled; upgrade your compiler')
return ''
def maybe_static(flag, libs):
if flag and not args.static:
libs = '-Wl,-Bstatic {} -Wl,-Bdynamic'.format(libs)
return libs
class Thrift(object):
def __init__(self, source, service):
self.source = source
@@ -184,7 +189,6 @@ scylla_tests = [
'tests/storage_proxy_test',
'tests/schema_change_test',
'tests/mutation_reader_test',
'tests/key_reader_test',
'tests/mutation_query_test',
'tests/row_cache_test',
'tests/test-serialization',
@@ -220,6 +224,10 @@ scylla_tests = [
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
'tests/database_test',
'tests/nonwrapping_range_test',
'tests/input_stream_test',
'tests/sstable_atomic_deletion_test',
'tests/virtual_reader_test',
]
apps = [
@@ -261,7 +269,9 @@ arg_parser.add_argument('--debuginfo', action = 'store', dest = 'debuginfo', typ
arg_parser.add_argument('--static-stdc++', dest = 'staticcxx', action = 'store_true',
help = 'Link libgcc and libstdc++ statically')
arg_parser.add_argument('--static-thrift', dest = 'staticthrift', action = 'store_true',
help = 'Link libthrift statically')
help = 'Link libthrift statically')
arg_parser.add_argument('--static-boost', dest = 'staticboost', action = 'store_true',
help = 'Link boost statically')
arg_parser.add_argument('--tests-debuginfo', action = 'store', dest = 'tests_debuginfo', type = int, default = 0,
help = 'Enable(1)/disable(0)compiler debug information generation for tests')
arg_parser.add_argument('--python', action = 'store', dest = 'python', default = 'python3',
@@ -297,9 +307,7 @@ scylla_core = (['database.cc',
'mutation_partition_serializer.cc',
'mutation_reader.cc',
'mutation_query.cc',
'key_reader.cc',
'keys.cc',
'clustering_key_filter.cc',
'sstables/sstables.cc',
'sstables/compress.cc',
'sstables/row.cc',
@@ -308,6 +316,7 @@ scylla_core = (['database.cc',
'sstables/compaction.cc',
'sstables/compaction_strategy.cc',
'sstables/compaction_manager.cc',
'sstables/atomic_deletion.cc',
'transport/event.cc',
'transport/event_notifier.cc',
'transport/server.cc',
@@ -327,6 +336,7 @@ scylla_core = (['database.cc',
'cql3/statements/authentication_statement.cc',
'cql3/statements/create_keyspace_statement.cc',
'cql3/statements/create_table_statement.cc',
'cql3/statements/create_view_statement.cc',
'cql3/statements/create_type_statement.cc',
'cql3/statements/create_user_statement.cc',
'cql3/statements/drop_keyspace_statement.cc',
@@ -425,6 +435,7 @@ scylla_core = (['database.cc',
'dht/i_partitioner.cc',
'dht/murmur3_partitioner.cc',
'dht/byte_ordered_partitioner.cc',
'dht/random_partitioner.cc',
'dht/boot_strapper.cc',
'dht/range_streamer.cc',
'unimplemented.cc',
@@ -483,7 +494,7 @@ scylla_core = (['database.cc',
'tracing/trace_state.cc',
'range_tombstone.cc',
'range_tombstone_list.cc',
'db/size_estimates_recorder.cc'
'disk-error-handler.cc'
]
+ [Antlr3Grammar('cql3/Cql.g')]
+ [Thrift('interface/cassandra.thrift', 'Cassandra')]
@@ -562,46 +573,54 @@ deps = {
'scylla': idls + ['main.cc'] + scylla_core + api,
}
tests_not_using_seastar_test_framework = set([
'tests/keys_test',
pure_boost_tests = set([
'tests/partitioner_test',
'tests/map_difference_test',
'tests/keys_test',
'tests/compound_test',
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
'tests/nonwrapping_range_test',
'tests/test-serialization',
'tests/range_test',
'tests/crc_test',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/cartesian_product_test',
])
tests_not_using_seastar_test_framework = set([
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
'tests/lsa_sync_eviction_test',
'tests/row_cache_alloc_stress',
'tests/perf_row_cache_update',
'tests/cartesian_product_test',
'tests/perf/perf_hash',
'tests/perf/perf_cql_parser',
'tests/message',
'tests/perf/perf_simple_query',
'tests/memory_footprint',
'tests/test-serialization',
'tests/gossip',
'tests/compound_test',
'tests/range_test',
'tests/crc_test',
'tests/perf/perf_sstable',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
])
]) | pure_boost_tests
for t in tests_not_using_seastar_test_framework:
if not t in scylla_tests:
raise Exception("Test %s not found in scylla_tests" % (t))
for t in scylla_tests:
deps[t] = scylla_tests_dependencies + [t + '.cc']
deps[t] = [t + '.cc']
if t not in tests_not_using_seastar_test_framework:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc']
deps['tests/bytes_ostream_test'] = ['tests/bytes_ostream_test.cc']
deps['tests/input_stream_test'] = ['tests/input_stream_test.cc']
deps['tests/UUID_test'] = ['utils/UUID_gen.cc', 'tests/UUID_test.cc']
deps['tests/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'tests/murmur_hash_test.cc']
deps['tests/allocation_strategy_test'] = ['tests/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
@@ -700,6 +719,8 @@ elif args.dpdk_target:
seastar_flags += ['--dpdk-target', args.dpdk_target]
if args.staticcxx:
seastar_flags += ['--static-stdc++']
if args.staticboost:
seastar_flags += ['--static-boost']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_flags += ['--compiler', args.cxx, '--cflags=%s' % (seastar_cflags)]
@@ -733,7 +754,14 @@ for mode in build_modes:
seastar_deps = 'practically_anything_can_change_so_lets_run_it_every_time_and_restat.'
args.user_cflags += " " + pkg_config("--cflags", "jsoncpp")
libs = "-lyaml-cpp -llz4 -lz -lsnappy " + pkg_config("--libs", "jsoncpp") + ' -lboost_filesystem' + ' -lcrypt' + ' -lboost_date_time'
libs = ' '.join(['-lyaml-cpp', '-llz4', '-lz', '-lsnappy', pkg_config("--libs", "jsoncpp"),
maybe_static(args.staticboost, '-lboost_filesystem'), ' -lcrypt',
maybe_static(args.staticboost, '-lboost_date_time'),
])
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
for pkg in pkgs:
args.user_cflags += ' ' + pkg_config('--cflags', pkg)
libs += ' ' + pkg_config('--libs', pkg)
@@ -847,6 +875,11 @@ with open(buildfile, 'w') as f:
f.write('build $builddir/{}/{}: ar.{} {}\n'.format(mode, binary, mode, str.join(' ', objs)))
else:
if binary.startswith('tests/'):
local_libs = '$libs'
if binary not in tests_not_using_seastar_test_framework or binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
# Our code's debugging information is huge, and multiplied
# by many tests yields ridiculous amounts of disk space.
# So we strip the tests by default; The user can very
@@ -854,15 +887,15 @@ with open(buildfile, 'w') as f:
# to the test name, e.g., "ninja build/release/testname_g"
f.write('build $builddir/{}/{}: {}.{} {} {}\n'.format(mode, binary, tests_link_rule, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
if has_thrift:
f.write(' libs = {} -lboost_system $libs\n'.format(thrift_libs))
f.write(' libs = {}\n'.format(local_libs))
f.write('build $builddir/{}/{}_g: link.{} {} {}\n'.format(mode, binary, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
f.write(' libs = {}\n'.format(local_libs))
else:
f.write('build $builddir/{}/{}: link.{} {} {}\n'.format(mode, binary, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
if has_thrift:
f.write(' libs = {} -lboost_system $libs\n'.format(thrift_libs))
if has_thrift:
f.write(' libs = {} {} $libs\n'.format(thrift_libs, maybe_static(args.staticboost, '-lboost_system')))
for src in srcs:
if src.endswith('.cc'):
obj = '$builddir/' + mode + '/' + src.replace('.cc', '.o')

View File

@@ -35,7 +35,7 @@ class converting_mutation_partition_applier : public mutation_partition_visitor
deletable_row* _current_row;
private:
static bool is_compatible(const column_definition& new_def, const data_type& old_type, column_kind kind) {
return new_def.kind == kind && new_def.type->is_value_compatible_with(*old_type);
return ::is_compatible(new_def.kind, kind) && new_def.type->is_value_compatible_with(*old_type);
}
void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
if (is_compatible(new_def, old_type, kind) && cell.timestamp() > new_def.dropped_at()) {

View File

@@ -40,6 +40,7 @@ options {
#include "cql3/statements/drop_keyspace_statement.hh"
#include "cql3/statements/create_index_statement.hh"
#include "cql3/statements/create_table_statement.hh"
#include "cql3/statements/create_view_statement.hh"
#include "cql3/statements/create_type_statement.hh"
#include "cql3/statements/drop_type_statement.hh"
#include "cql3/statements/alter_type_statement.hh"
@@ -340,6 +341,7 @@ cqlStatement returns [shared_ptr<raw::parsed_statement> stmt]
| st30=createAggregateStatement { $stmt = st30; }
| st31=dropAggregateStatement { $stmt = st31; }
#endif
| st32=createViewStatement { $stmt = st32; }
;
/*
@@ -716,7 +718,7 @@ createTableStatement returns [shared_ptr<cql3::statements::create_table_statemen
cfamDefinition[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
: '(' cfamColumns[expr] ( ',' cfamColumns[expr]? )* ')'
( K_WITH cfamProperty[expr] ( K_AND cfamProperty[expr] )*)?
( K_WITH cfamProperty[$expr->properties()] ( K_AND cfamProperty[$expr->properties()] )*)?
;
cfamColumns[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
@@ -732,15 +734,15 @@ pkDef[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
| '(' k1=ident { l.push_back(k1); } ( ',' kn=ident { l.push_back(kn); } )* ')' { $expr->add_key_aliases(l); }
;
cfamProperty[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
: property[expr->properties]
| K_COMPACT K_STORAGE { $expr->set_compact_storage(); }
cfamProperty[cql3::statements::cf_properties& expr]
: property[$expr.properties()]
| K_COMPACT K_STORAGE { $expr.set_compact_storage(); }
| K_CLUSTERING K_ORDER K_BY '(' cfamOrdering[expr] (',' cfamOrdering[expr])* ')'
;
cfamOrdering[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
cfamOrdering[cql3::statements::cf_properties& expr]
@init{ bool reversed=false; }
: k=ident (K_ASC | K_DESC { reversed=true;} ) { $expr->set_ordering(k, reversed); }
: k=ident (K_ASC | K_DESC { reversed=true;} ) { $expr.set_ordering(k, reversed); }
;
@@ -787,6 +789,39 @@ indexIdent returns [::shared_ptr<index_target::raw> id]
| K_FULL '(' c=cident ')' { $id = index_target::raw::full_collection(c); }
;
/**
* CREATE MATERIALIZED VIEW <viewName> AS
* SELECT <columns>
* FROM <CF>
* WHERE <pkColumns> IS NOT NULL
* PRIMARY KEY (<pkColumns>)
* WITH <property> = <value> AND ...;
*/
createViewStatement returns [::shared_ptr<create_view_statement> expr]
@init {
bool if_not_exists = false;
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys;
std::vector<::shared_ptr<cql3::column_identifier::raw>> composite_keys;
}
: K_CREATE K_MATERIALIZED K_VIEW (K_IF K_NOT K_EXISTS { if_not_exists = true; })? cf=columnFamilyName K_AS
K_SELECT sclause=selectClause K_FROM basecf=columnFamilyName
(K_WHERE wclause=whereClause)?
K_PRIMARY K_KEY (
'(' '(' k1=cident { partition_keys.push_back(k1); } ( ',' kn=cident { partition_keys.push_back(kn); } )* ')' ( ',' c1=cident { composite_keys.push_back(c1); } )* ')'
| '(' k1=cident { partition_keys.push_back(k1); } ( ',' cn=cident { composite_keys.push_back(cn); } )* ')'
)
{
$expr = ::make_shared<create_view_statement>(
std::move(cf),
std::move(basecf),
std::move(sclause),
std::move(wclause),
std::move(partition_keys),
std::move(composite_keys),
if_not_exists);
}
( K_WITH cfamProperty[{ $expr->properties() }] ( K_AND cfamProperty[{ $expr->properties() }] )*)?
;
#if 0
/**
@@ -1304,7 +1339,8 @@ relation[std::vector<cql3::relation_ptr>& clauses]
| K_TOKEN l=tupleOfIdentifiers type=relationType t=term
{ $clauses.emplace_back(::make_shared<cql3::token_relation>(std::move(l), *type, std::move(t))); }
| name=cident K_IS K_NOT K_NULL {
$clauses.emplace_back(make_shared<cql3::single_column_relation>(std::move(name), cql3::operator_type::IS_NOT, cql3::constants::NULL_LITERAL)); }
| name=cident K_IN marker=inMarker
{ $clauses.emplace_back(make_shared<cql3::single_column_relation>(std::move(name), cql3::operator_type::IN, std::move(marker))); }
| name=cident K_IN in_values=singleColumnInValues
@@ -1528,6 +1564,8 @@ K_KEYSPACE: ( K E Y S P A C E
K_KEYSPACES: K E Y S P A C E S;
K_COLUMNFAMILY:( C O L U M N F A M I L Y
| T A B L E );
K_MATERIALIZED:M A T E R I A L I Z E D;
K_VIEW: V I E W;
K_INDEX: I N D E X;
K_CUSTOM: C U S T O M;
K_ON: O N;
@@ -1551,6 +1589,7 @@ K_DESC: D E S C;
K_ALLOW: A L L O W;
K_FILTERING: F I L T E R I N G;
K_IF: I F;
K_IS: I S;
K_CONTAINS: C O N T A I N S;
K_GRANT: G R A N T;

View File

@@ -52,5 +52,6 @@ const operator_type operator_type::IN(7, operator_type::IN, "IN");
const operator_type operator_type::CONTAINS(5, operator_type::CONTAINS, "CONTAINS");
const operator_type operator_type::CONTAINS_KEY(6, operator_type::CONTAINS_KEY, "CONTAINS_KEY");
const operator_type operator_type::NEQ(8, operator_type::NEQ, "!=");
const operator_type operator_type::IS_NOT(9, operator_type::IS_NOT, "IS NOT");
}

View File

@@ -58,6 +58,7 @@ public:
static const operator_type CONTAINS;
static const operator_type CONTAINS_KEY;
static const operator_type NEQ;
static const operator_type IS_NOT;
private:
int32_t _b;
const operator_type& _reverse;

View File

@@ -93,12 +93,6 @@ public:
specific_options options,
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
std::vector<std::vector<bytes_view_opt>> value_views,
bool skip_metadata,
specific_options options,
cql_serialization_format sf);
// Batch query_options constructor
explicit query_options(query_options&&, std::vector<std::vector<bytes_view_opt>> value_views);

View File

@@ -64,8 +64,8 @@ class query_processor::internal_state {
service::query_state _qs;
public:
internal_state()
: _qs(service::client_state{service::client_state::internal_tag()}) {
}
: _qs(service::client_state{service::client_state::internal_tag()})
{ }
operator service::query_state&() {
return _qs;
}
@@ -88,17 +88,37 @@ api::timestamp_type query_processor::next_timestamp() {
}
query_processor::query_processor(distributed<service::storage_proxy>& proxy,
distributed<database>& db)
distributed<database>& db)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _collectd_regs{
scollectd::add_polled_metric(scollectd::type_instance_id("query_processor"
, scollectd::per_cpu_plugin_instance
, "total_operations", "statements_prepared")
, scollectd::make_typed(scollectd::data_type::DERIVE, _stats.prepare_invocations)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "reads")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.reads)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "inserts")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.inserts)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "updates")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.updates)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "deletes")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.deletes)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "batches")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.batches))}
, _internal_state(new internal_state())
{
_collectd_regs.push_back(
scollectd::add_polled_metric(scollectd::type_instance_id("query_processor"
, scollectd::per_cpu_plugin_instance
, "total_operations", "statements_prepared")
, scollectd::make_typed(scollectd::data_type::DERIVE, _stats.prepare_invocations)));
service::get_local_migration_manager().register_listener(_migration_subscriber.get());
}
@@ -133,8 +153,9 @@ query_processor::process(const sstring_view& query_string, service::query_state&
}
future<::shared_ptr<result_message>>
query_processor::process_statement(::shared_ptr<cql_statement> statement, service::query_state& query_state,
const query_options& options)
query_processor::process_statement(::shared_ptr<cql_statement> statement,
service::query_state& query_state,
const query_options& options)
{
#if 0
logger.trace("Process {} @CL.{}", statement, options.getConsistency());
@@ -145,7 +166,7 @@ query_processor::process_statement(::shared_ptr<cql_statement> statement, servic
statement->validate(_proxy, client_state);
future<::shared_ptr<transport::messages::result_message>> fut = make_ready_future<::shared_ptr<transport::messages::result_message>>();
auto fut = make_ready_future<::shared_ptr<transport::messages::result_message>>();
if (client_state.is_internal()) {
fut = statement->execute_internal(_proxy, query_state, options);
} else {
@@ -170,7 +191,9 @@ query_processor::prepare(const std::experimental::string_view& query_string, ser
}
future<::shared_ptr<transport::messages::result_message::prepared>>
query_processor::prepare(const std::experimental::string_view& query_string, const service::client_state& client_state, bool for_thrift)
query_processor::prepare(const std::experimental::string_view& query_string,
const service::client_state& client_state,
bool for_thrift)
{
auto existing = get_stored_prepared_statement(query_string, client_state.get_raw_keyspace(), for_thrift);
if (existing) {
@@ -186,7 +209,9 @@ query_processor::prepare(const std::experimental::string_view& query_string, con
}
::shared_ptr<transport::messages::result_message::prepared>
query_processor::get_stored_prepared_statement(const std::experimental::string_view& query_string, const sstring& keyspace, bool for_thrift)
query_processor::get_stored_prepared_statement(const std::experimental::string_view& query_string,
const sstring& keyspace,
bool for_thrift)
{
if (for_thrift) {
auto statement_id = compute_thrift_id(query_string, keyspace);
@@ -206,8 +231,10 @@ query_processor::get_stored_prepared_statement(const std::experimental::string_v
}
future<::shared_ptr<transport::messages::result_message::prepared>>
query_processor::store_prepared_statement(const std::experimental::string_view& query_string, const sstring& keyspace,
::shared_ptr<statements::prepared_statement> prepared, bool for_thrift)
query_processor::store_prepared_statement(const std::experimental::string_view& query_string,
const sstring& keyspace,
::shared_ptr<statements::prepared_statement> prepared,
bool for_thrift)
{
#if 0
// Concatenate the current keyspace so we don't mix prepared statements between keyspace (#5352).
@@ -278,7 +305,7 @@ query_processor::get_statement(const sstring_view& query, const service::client_
Tracing.trace("Preparing statement");
#endif
++_stats.prepare_invocations;
return statement->prepare(_db.local());
return statement->prepare(_db.local(), _cql_stats);
}
::shared_ptr<raw::parsed_statement>
@@ -301,7 +328,7 @@ query_processor::parse_statement(const sstring_view& query)
if (!statement) {
throw exceptions::syntax_exception("Parsing failed");
};
}
return std::move(statement);
} catch (const exceptions::recognition_exception& e) {
throw exceptions::syntax_exception(sprint("Invalid or malformed CQL query string: %s", e.what()));
@@ -313,10 +340,10 @@ query_processor::parse_statement(const sstring_view& query)
}
}
query_options query_processor::make_internal_options(
::shared_ptr<statements::prepared_statement> p,
const std::initializer_list<data_value>& values,
db::consistency_level cl) {
query_options query_processor::make_internal_options(::shared_ptr<statements::prepared_statement> p,
const std::initializer_list<data_value>& values,
db::consistency_level cl)
{
if (p->bound_names.size() != values.size()) {
throw std::invalid_argument(sprint("Invalid number of values. Expecting %d but got %d", p->bound_names.size(), values.size()));
}
@@ -335,20 +362,21 @@ query_options query_processor::make_internal_options(
return query_options(cl, bound_values);
}
::shared_ptr<statements::prepared_statement> query_processor::prepare_internal(
const sstring& query_string) {
::shared_ptr<statements::prepared_statement> query_processor::prepare_internal(const sstring& query_string)
{
auto& p = _internal_statements[query_string];
if (p == nullptr) {
auto np = parse_statement(query_string)->prepare(_db.local());
auto np = parse_statement(query_string)->prepare(_db.local(), _cql_stats);
np->statement->validate(_proxy, *_internal_state);
p = std::move(np); // inserts it into map
}
return p;
}
future<::shared_ptr<untyped_result_set>> query_processor::execute_internal(
const sstring& query_string,
const std::initializer_list<data_value>& values) {
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(const sstring& query_string,
const std::initializer_list<data_value>& values)
{
if (log.is_enabled(logging::log_level::trace)) {
log.trace("execute_internal: \"{}\" ({})", query_string, ::join(", ", values));
}
@@ -356,47 +384,49 @@ future<::shared_ptr<untyped_result_set>> query_processor::execute_internal(
return execute_internal(p, values);
}
future<::shared_ptr<untyped_result_set>> query_processor::execute_internal(
::shared_ptr<statements::prepared_statement> p,
const std::initializer_list<data_value>& values) {
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(::shared_ptr<statements::prepared_statement> p,
const std::initializer_list<data_value>& values)
{
auto opts = make_internal_options(p, values);
return do_with(std::move(opts),
[this, p = std::move(p)](query_options & opts) {
return p->statement->execute_internal(_proxy, *_internal_state, opts).then(
[p](::shared_ptr<transport::messages::result_message> msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
return do_with(std::move(opts), [this, p = std::move(p)](auto& opts) {
return p->statement->execute_internal(_proxy, *_internal_state, opts).then([p](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
}
future<::shared_ptr<untyped_result_set>> query_processor::process(
const sstring& query_string,
db::consistency_level cl, const std::initializer_list<data_value>& values, bool cache)
future<::shared_ptr<untyped_result_set>>
query_processor::process(const sstring& query_string,
db::consistency_level cl,
const std::initializer_list<data_value>& values,
bool cache)
{
auto p = cache ? prepare_internal(query_string) : parse_statement(query_string)->prepare(_db.local());
auto p = cache ? prepare_internal(query_string) : parse_statement(query_string)->prepare(_db.local(), _cql_stats);
if (!cache) {
p->statement->validate(_proxy, *_internal_state);
}
return process(p, cl, values);
}
future<::shared_ptr<untyped_result_set>> query_processor::process(
::shared_ptr<statements::prepared_statement> p,
db::consistency_level cl, const std::initializer_list<data_value>& values)
future<::shared_ptr<untyped_result_set>>
query_processor::process(::shared_ptr<statements::prepared_statement> p,
db::consistency_level cl,
const std::initializer_list<data_value>& values)
{
auto opts = make_internal_options(p, values, cl);
return do_with(std::move(opts),
[this, p = std::move(p)](query_options & opts) {
return p->statement->execute(_proxy, *_internal_state, opts).then(
[p](::shared_ptr<transport::messages::result_message> msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
return do_with(std::move(opts), [this, p = std::move(p)](auto & opts) {
return p->statement->execute(_proxy, *_internal_state, opts).then([p](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
}
future<::shared_ptr<transport::messages::result_message>>
query_processor::process_batch(::shared_ptr<statements::batch_statement> batch, service::query_state& query_state, query_options& options) {
query_processor::process_batch(::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options)
{
return batch->check_access(query_state.get_client_state()).then([this, &query_state, &options, batch] {
batch->validate();
batch->validate(_proxy, query_state.get_client_state());

View File

@@ -75,7 +75,9 @@ private:
uint64_t prepare_invocations = 0;
} _stats;
std::vector<scollectd::registration> _collectd_regs;
cql_stats _cql_stats;
scollectd::registrations _collectd_regs;
class internal_state;
std::unique_ptr<internal_state> _internal_state;
@@ -92,6 +94,11 @@ public:
distributed<service::storage_proxy>& proxy() {
return _proxy;
}
cql_stats& get_cql_stats() {
return _cql_stats;
}
#if 0
public static final QueryProcessor instance = new QueryProcessor();
#endif

View File

@@ -156,6 +156,10 @@ public:
return new_contains_restriction(db, schema, bound_names, false);
} else if (_relation_type == operator_type::CONTAINS_KEY) {
return new_contains_restriction(db, schema, bound_names, true);
} else if (_relation_type == operator_type::IS_NOT) {
// This case is not supposed to happen: statement_restrictions
// constructor does not call this function for views' IS_NOT.
throw exceptions::invalid_request_exception(sprint("Unsupported \"IS NOT\" relation: %s", to_string()));
} else {
throw exceptions::invalid_request_exception(sprint("Unsupported \"!=\" relation: %s", to_string()));
}

View File

@@ -393,8 +393,12 @@ public:
auto prefix = clustering_key_prefix::from_optional_exploded(*_schema, vals);
return bounds_range_type::bound(prefix, is_inclusive(b));
};
auto range = bounds_range_type(read_bound(statements::bound::START), read_bound(statements::bound::END));
return { range };
auto range = wrapping_range<clustering_key_prefix>(read_bound(statements::bound::START), read_bound(statements::bound::END));
auto bounds = bound_view::from_range(range);
if (bound_view::compare(*_schema)(bounds.second, bounds.first)) {
return { };
}
return { bounds_range_type(std::move(range)) };
}
#if 0
@Override

View File

@@ -46,6 +46,8 @@
#include "cartesian_product.hh"
#include "cql3/restrictions/primary_key_restrictions.hh"
#include "cql3/restrictions/single_column_restrictions.hh"
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/adaptor/filtered.hpp>
namespace cql3 {
@@ -352,7 +354,14 @@ single_column_primary_key_restrictions<partition_key>::bounds_ranges(const query
template<>
std::vector<query::clustering_range>
single_column_primary_key_restrictions<clustering_key_prefix>::bounds_ranges(const query_options& options) const {
auto bounds = compute_bounds(options);
auto wrapping_bounds = compute_bounds(options);
auto bounds = boost::copy_range<query::clustering_row_ranges>(wrapping_bounds
| boost::adaptors::filtered([&](auto&& r) {
auto bounds = bound_view::from_range(r);
return !bound_view::compare(*_schema)(bounds.second, bounds.first);
})
| boost::adaptors::transformed([&](auto&& r) { return query::clustering_range(std::move(r));
}));
auto less_cmp = clustering_key_prefix::less_compare(*_schema);
std::sort(bounds.begin(), bounds.end(), [&] (query::clustering_range& x, query::clustering_range& y) {
if (!x.start() && !y.start()) {

View File

@@ -28,6 +28,9 @@
#include "single_column_primary_key_restrictions.hh"
#include "token_restriction.hh"
#include "cql3/single_column_relation.hh"
#include "cql3/constants.hh"
namespace cql3 {
namespace restrictions {
@@ -131,13 +134,21 @@ statement_restrictions::statement_restrictions(schema_ptr schema)
, _clustering_columns_restrictions(get_initial_key_restrictions<clustering_key_prefix>())
, _nonprimary_key_restrictions(::make_shared<single_column_restrictions>(schema))
{ }
#if 0
static const column_definition*
to_column_definition(const schema_ptr& schema, const ::shared_ptr<column_identifier::raw>& entity) {
return get_column_definition(schema,
*entity->prepare_column_identifier(schema));
}
#endif
statement_restrictions::statement_restrictions(database& db,
schema_ptr schema,
const std::vector<::shared_ptr<relation>>& where_clause,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection)
bool select_a_collection,
bool for_view)
: statement_restrictions(schema)
{
/*
@@ -149,7 +160,31 @@ statement_restrictions::statement_restrictions(database& db,
*/
if (!where_clause.empty()) {
for (auto&& relation : where_clause) {
add_restriction(relation->to_restriction(db, schema, bound_names));
if (relation->get_operator() == cql3::operator_type::IS_NOT) {
single_column_relation* r =
dynamic_cast<single_column_relation*>(relation.get());
// The "IS NOT NULL" restriction is only supported (and
// mandatory) for materialized view creation:
if (!r) {
throw exceptions::invalid_request_exception("IS NOT only supports single column");
}
// currently, the grammar only allows the NULL argument to be
// "IS NOT", so this assertion should not be able to fail
assert(r->get_value() == cql3::constants::NULL_LITERAL);
auto col_id = r->get_entity()->prepare_column_identifier(schema);
const auto *cd = get_column_definition(schema, *col_id);
if (!cd) {
throw exceptions::invalid_request_exception(sprint("restriction '%s' unknown column %s", relation->to_string(), r->get_entity()->to_string()));
}
_not_null_columns.insert(cd);
if (!for_view) {
throw exceptions::invalid_request_exception(sprint("restriction '%s' is only supported in materialized view creation", relation->to_string()));
}
} else {
add_restriction(relation->to_restriction(db, schema, bound_names));
}
}
}

View File

@@ -83,6 +83,8 @@ private:
*/
::shared_ptr<single_column_restrictions> _nonprimary_key_restrictions;
std::unordered_set<const column_definition*> _not_null_columns;
/**
* The restrictions used to build the index expressions
*/
@@ -112,7 +114,8 @@ public:
const std::vector<::shared_ptr<relation>>& where_clause,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection);
bool select_a_collection,
bool for_view = false);
private:
void add_restriction(::shared_ptr<restriction> restriction);
void add_single_column_restriction(::shared_ptr<single_column_restriction> restriction);

View File

@@ -97,7 +97,14 @@ public:
if (!buf) {
throw exceptions::invalid_request_exception("Invalid null token value");
}
return dht::token(dht::token::kind::key, *buf);
auto tk = dht::global_partitioner().from_bytes(*buf);
if (tk.is_minimum() && !is_start(b)) {
// The token was parsed as a minimum marker (token::kind::before_all_keys), but
// as it appears in the end bound position, it is actually the maximum marker
// (token::kind::after_all_keys).
return dht::maximum_token();
}
return tk;
};
const auto start_token = get_token_bound(statements::bound::START);

View File

@@ -207,7 +207,7 @@ protected:
auto& columns = schema->all_columns_in_select_order();
cds.reserve(columns.size());
for (auto& c : columns) {
if (!c.is_compact_value() || !c.name().empty()) {
if (!schema->is_dense() || !c.is_regular() || !c.name().empty()) {
cds.emplace_back(&c);
}
}
@@ -232,7 +232,7 @@ uint32_t selection::add_column_for_ordering(const column_definition& c) {
raw_selector::to_selectables(raw_selectors, schema), db, schema, defs);
auto metadata = collect_metadata(schema, raw_selectors, *factories);
if (processes_selection(raw_selectors)) {
if (processes_selection(raw_selectors) || raw_selectors.size() != defs.size()) {
return ::make_shared<selection_with_processing>(schema, std::move(defs), std::move(metadata), std::move(factories));
} else {
return ::make_shared<simple_selection>(schema, std::move(defs), std::move(metadata), false);
@@ -393,9 +393,6 @@ void result_set_builder::visitor::accept_new_row(
case column_kind::regular_column:
add_value(*def, row_iterator);
break;
case column_kind::compact_column:
add_value(*def, row_iterator);
break;
case column_kind::static_column:
add_value(*def, static_row_iterator);
break;

View File

@@ -95,7 +95,7 @@ single_column_relation::to_receivers(schema_ptr schema, const column_definition&
using namespace statements::request_validations;
auto receiver = column_def.column_specification;
if (column_def.is_compact_value()) {
if (schema->is_dense() && column_def.is_regular()) {
throw exceptions::invalid_request_exception(sprint(
"Predicates on the non-primary-key column (%s) of a COMPACT table are not yet supported", column_def.name_as_text()));
}

View File

@@ -110,6 +110,11 @@ public:
::shared_ptr<term::raw> get_map_key() {
return _map_key;
}
::shared_ptr<term::raw> get_value() {
return _value;
}
protected:
virtual ::shared_ptr<term> to_term(const std::vector<::shared_ptr<column_specification>>& receivers,
::shared_ptr<term::raw> raw, database& db, const sstring& keyspace,

View File

@@ -40,6 +40,7 @@
*/
#include "alter_keyspace_statement.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
#include "db/system_keyspace.hh"
#include "database.hh"
@@ -100,3 +101,9 @@ shared_ptr<transport::event::schema_change> cql3::statements::alter_keyspace_sta
transport::event::schema_change::change_type::UPDATED,
keyspace());
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_keyspace_statement>(*this));
}

View File

@@ -63,6 +63,7 @@ public:
void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -40,6 +40,7 @@
*/
#include "cql3/statements/alter_table_statement.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
#include "validation.hh"
#include "db/config.hh"
@@ -180,7 +181,6 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
}
break;
case column_kind::compact_column:
case column_kind::regular_column:
case column_kind::static_column:
// Thrift allows to change a column validator so CFMetaData.validateCompatibility will let it slide
@@ -273,6 +273,11 @@ shared_ptr<transport::event::schema_change> alter_table_statement::change_event(
transport::event::schema_change::target_type::TABLE, keyspace(), column_family());
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_table_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_table_statement>(*this));
}
}
}

View File

@@ -80,6 +80,7 @@ public:
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -39,9 +39,11 @@
#include "cql3/statements/alter_type_statement.hh"
#include "cql3/statements/create_type_statement.hh"
#include "prepared_statement.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "boost/range/adaptor/map.hpp"
#include "stdx.hh"
namespace cql3 {
@@ -85,14 +87,14 @@ const sstring& alter_type_statement::keyspace() const
return _name.get_keyspace();
}
static int32_t get_idx_of_field(user_type type, shared_ptr<column_identifier> field)
static stdx::optional<uint32_t> get_idx_of_field(user_type type, shared_ptr<column_identifier> field)
{
for (uint32_t i = 0; i < type->field_names().size(); ++i) {
if (field->name() == type->field_names()[i]) {
return i;
return {i};
}
}
return -1;
return {};
}
void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, bool is_local_only)
@@ -163,7 +165,7 @@ alter_type_statement::add_or_alter::add_or_alter(const ut_name& name, bool is_ad
user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_update) const
{
if (get_idx_of_field(to_update, _field_name) >= 0) {
if (get_idx_of_field(to_update, _field_name)) {
throw exceptions::invalid_request_exception(sprint("Cannot add new field %s to type %s: a field of the same name already exists", _field_name->name(), _name.to_string()));
}
@@ -180,19 +182,19 @@ user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_
user_type alter_type_statement::add_or_alter::do_alter(database& db, user_type to_update) const
{
uint32_t idx = get_idx_of_field(to_update, _field_name);
if (idx < 0) {
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, _field_name);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", _field_name->name(), _name.to_string()));
}
auto previous = to_update->field_types()[idx];
auto previous = to_update->field_types()[*idx];
auto new_type = _field_type->prepare(db, keyspace())->get_type();
if (!new_type->is_compatible_with(*previous)) {
throw exceptions::invalid_request_exception(sprint("Type %s in incompatible with previous type %s of field %s in user type %s", _field_type->to_string(), previous->as_cql3_type()->to_string(), _field_name->name(), _name.to_string()));
}
std::vector<data_type> new_types(to_update->field_types());
new_types[idx] = new_type;
new_types[*idx] = new_type;
return user_type_impl::get_instance(to_update->_keyspace, to_update->_name, to_update->field_names(), std::move(new_types));
}
@@ -216,17 +218,27 @@ user_type alter_type_statement::renames::make_updated_type(database& db, user_ty
std::vector<bytes> new_names(to_update->field_names());
for (auto&& rename : _renames) {
auto&& from = rename.first;
int32_t idx = get_idx_of_field(to_update, from);
if (idx < 0) {
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, from);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", from->to_string(), _name.to_string()));
}
new_names[idx] = rename.second->name();
new_names[*idx] = rename.second->name();
}
auto&& updated = user_type_impl::get_instance(to_update->_keyspace, to_update->_name, std::move(new_names), to_update->field_types());
create_type_statement::check_for_duplicate_names(updated);
return updated;
}
shared_ptr<cql3::statements::prepared_statement>
alter_type_statement::add_or_alter::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_type_statement::add_or_alter>(*this));
}
shared_ptr<cql3::statements::prepared_statement>
alter_type_statement::renames::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_type_statement::renames>(*this));
}
}
}

View File

@@ -84,6 +84,7 @@ public:
const shared_ptr<column_identifier> field_name,
const shared_ptr<cql3_type::raw> field_type);
virtual user_type make_updated_type(database& db, user_type to_update) const override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
private:
user_type do_add(database& db, user_type to_update) const;
user_type do_alter(database& db, user_type to_update) const;
@@ -100,6 +101,7 @@ public:
void add_rename(shared_ptr<column_identifier> previous_name, shared_ptr<column_identifier> new_name);
virtual user_type make_updated_type(database& db, user_type to_update) const override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -47,7 +47,7 @@ uint32_t cql3::statements::authentication_statement::get_bound_terms() {
}
::shared_ptr<cql3::statements::prepared_statement> cql3::statements::authentication_statement::prepare(
database& db) {
database& db, cql_stats& stats) {
return ::make_shared<prepared>(this->shared_from_this());
}

View File

@@ -54,7 +54,7 @@ class authentication_statement : public raw::parsed_statement, public cql_statem
public:
uint32_t get_bound_terms() override;
::shared_ptr<prepared> prepare(database& db) override;
::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
bool uses_function(const sstring& ks_name, const sstring& function_name) const override;

View File

@@ -47,7 +47,7 @@ uint32_t cql3::statements::authorization_statement::get_bound_terms() {
}
::shared_ptr<cql3::statements::prepared_statement> cql3::statements::authorization_statement::prepare(
database& db) {
database& db, cql_stats& stats) {
return ::make_shared<parsed_statement::prepared>(this->shared_from_this());
}

View File

@@ -58,7 +58,7 @@ class authorization_statement : public raw::parsed_statement, public cql_stateme
public:
uint32_t get_bound_terms() override;
::shared_ptr<prepared> prepare(database& db) override;
::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
bool uses_function(const sstring& ks_name, const sstring& function_name) const override;

View File

@@ -58,7 +58,7 @@ bool batch_statement::depends_on_column_family(const sstring& cf_name) const
}
void batch_statement::verify_batch_size(const std::vector<mutation>& mutations) {
size_t warn_threshold = service::get_local_storage_proxy().get_db().local().get_config().batch_size_warn_threshold_in_kb();
size_t warn_threshold = service::get_local_storage_proxy().get_db().local().get_config().batch_size_warn_threshold_in_kb() * 1024;
class my_partition_visitor : public mutation_partition_visitor {
public:
@@ -87,35 +87,33 @@ void batch_statement::verify_batch_size(const std::vector<mutation>& mutations)
m.partition().accept(*m.schema(), v);
}
auto size = v.size / 1024;
if (size > warn_threshold) {
if (v.size > warn_threshold) {
std::unordered_set<sstring> ks_cf_pairs;
for (auto&& m : mutations) {
ks_cf_pairs.insert(m.schema()->ks_name() + "." + m.schema()->cf_name());
}
_logger.warn(
"Batch of prepared statements for {} is of size {}, exceeding specified threshold of {} by {}.{}",
join(", ", ks_cf_pairs), size, warn_threshold,
size - warn_threshold, "");
join(", ", ks_cf_pairs), v.size, warn_threshold,
v.size - warn_threshold, "");
}
}
namespace raw {
shared_ptr<prepared_statement>
batch_statement::prepare(database& db) {
batch_statement::prepare(database& db, cql_stats& stats) {
auto&& bound_names = get_bound_variables();
std::vector<shared_ptr<cql3::statements::modification_statement>> statements;
for (auto&& parsed : _parsed_statements) {
statements.push_back(parsed->prepare(db, bound_names));
statements.push_back(parsed->prepare(db, bound_names, stats));
}
auto&& prep_attrs = _attrs->prepare(db, "[batch]", "[batch]");
prep_attrs->collect_marker_specification(bound_names);
cql3::statements::batch_statement batch_statement_(bound_names->size(), _type, std::move(statements), std::move(prep_attrs));
cql3::statements::batch_statement batch_statement_(bound_names->size(), _type, std::move(statements), std::move(prep_attrs), stats);
batch_statement_.validate();
return ::make_shared<prepared>(make_shared(std::move(batch_statement_)),

View File

@@ -74,6 +74,7 @@ private:
std::vector<shared_ptr<modification_statement>> _statements;
std::unique_ptr<attributes> _attrs;
bool _has_conditions;
cql_stats& _stats;
public:
/**
* Creates a new BatchStatement from a list of statements and a
@@ -85,10 +86,12 @@ public:
*/
batch_statement(int bound_terms, type type_,
std::vector<shared_ptr<modification_statement>> statements,
std::unique_ptr<attributes> attrs)
std::unique_ptr<attributes> attrs,
cql_stats& stats)
: _bound_terms(bound_terms), _type(type_), _statements(std::move(statements))
, _attrs(std::move(attrs))
, _has_conditions(boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::has_conditions))) {
, _has_conditions(boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::has_conditions)))
, _stats(stats) {
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
@@ -175,6 +178,7 @@ private:
boost::make_counting_iterator<size_t>(_statements.size()),
[this, &storage, &options, now, local, &result, trace_state] (size_t i) {
auto&& statement = _statements[i];
statement->inc_cql_stats();
auto&& statement_options = options.for_statement(i);
auto timestamp = _attrs->get_timestamp(now, statement_options);
return statement->get_mutations(storage, statement_options, local, timestamp, trace_state).then([&result] (auto&& more) {
@@ -195,6 +199,7 @@ public:
virtual future<shared_ptr<transport::messages::result_message>> execute(
distributed<service::storage_proxy>& storage, service::query_state& state, const query_options& options) override {
++_stats.batches;
return execute(storage, state, options, false, options.get_timestamp(state));
}
private:

View File

@@ -92,20 +92,20 @@ void cf_prop_defs::validate() {
throw exceptions::configuration_exception(sstring("Missing sub-option '") + COMPACTION_STRATEGY_CLASS_KEY + "' for the '" + KW_COMPACTION + "' option.");
}
_compaction_strategy_class = sstables::compaction_strategy::type(strategy->second);
#if 0
compactionOptions.remove(COMPACTION_STRATEGY_CLASS_KEY);
remove_from_map_if_exists(KW_COMPACTION, COMPACTION_STRATEGY_CLASS_KEY);
#if 0
CFMetaData.validateCompactionOptions(compactionStrategyClass, compactionOptions);
#endif
}
auto compression_options = get_compression_options();
if (!compression_options.empty()) {
auto sstable_compression_class = compression_options.find(sstring(compression_parameters::SSTABLE_COMPRESSION));
if (sstable_compression_class == compression_options.end()) {
if (compression_options && !compression_options->empty()) {
auto sstable_compression_class = compression_options->find(sstring(compression_parameters::SSTABLE_COMPRESSION));
if (sstable_compression_class == compression_options->end()) {
throw exceptions::configuration_exception(sstring("Missing sub-option '") + compression_parameters::SSTABLE_COMPRESSION + "' for the '" + KW_COMPRESSION + "' option.");
}
compression_parameters cp(compression_options);
compression_parameters cp(*compression_options);
cp.validate();
}
@@ -131,12 +131,12 @@ std::map<sstring, sstring> cf_prop_defs::get_compaction_options() const {
return std::map<sstring, sstring>{};
}
std::map<sstring, sstring> cf_prop_defs::get_compression_options() const {
stdx::optional<std::map<sstring, sstring>> cf_prop_defs::get_compression_options() const {
auto compression_options = get_map(KW_COMPRESSION);
if (compression_options) {
return compression_options.value();
return { compression_options.value() };
}
return std::map<sstring, sstring>{};
return { };
}
int32_t cf_prop_defs::get_default_time_to_live() const
@@ -206,8 +206,9 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder) {
}
builder.set_bloom_filter_fp_chance(get_double(KW_BF_FP_CHANCE, builder.get_bloom_filter_fp_chance()));
if (!get_compression_options().empty()) {
builder.set_compressor_params(compression_parameters(get_compression_options()));
auto compression_options = get_compression_options();
if (compression_options) {
builder.set_compressor_params(compression_parameters(*compression_options));
}
#if 0
CachingOptions cachingOptions = getCachingOptions();

View File

@@ -82,7 +82,7 @@ private:
public:
void validate();
std::map<sstring, sstring> get_compaction_options() const;
std::map<sstring, sstring> get_compression_options() const;
stdx::optional<std::map<sstring, sstring>> get_compression_options() const;
#if 0
public CachingOptions getCachingOptions() throws SyntaxException, ConfigurationException
{

View File

@@ -0,0 +1,97 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/statements/cf_prop_defs.hh"
namespace cql3 {
namespace statements {
/**
* Class for common statement properties.
*/
class cf_properties final {
const ::shared_ptr<cf_prop_defs> _properties = ::make_shared<cf_prop_defs>();
bool _use_compact_storage = false;
std::vector<std::pair<::shared_ptr<column_identifier>, bool>> _defined_ordering; // Insertion ordering is important
public:
auto& properties() const {
return _properties;
}
bool use_compact_storage() const {
return _use_compact_storage;
}
void set_compact_storage() {
_use_compact_storage = true;
}
auto& defined_ordering() const {
return _defined_ordering;
}
data_type get_reversable_type(::shared_ptr<column_identifier> t, data_type type) const {
auto is_reversed = find_ordering_info(t);
return is_reversed && *is_reversed ? reversed_type_impl::get_instance(type) : type;
}
std::experimental::optional<bool> find_ordering_info(::shared_ptr<column_identifier> type) const {
for (auto& t: _defined_ordering) {
if (*(t.first) == *type) {
return t.second;
}
}
return {};
}
void set_ordering(::shared_ptr<column_identifier> alias, bool reversed) {
_defined_ordering.emplace_back(alias, reversed);
}
void validate() {
_properties->validate();
}
};
}
}

View File

@@ -40,6 +40,7 @@
*/
#include "create_index_statement.hh"
#include "prepared_statement.hh"
#include "validation.hh"
#include "service/storage_proxy.hh"
#include "service/migration_manager.hh"
@@ -205,4 +206,9 @@ cql3::statements::create_index_statement::announce_migration(distributed<service
});
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::create_index_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_index_statement>(*this));
}

View File

@@ -87,6 +87,7 @@ public:
transport::event::schema_change::target_type::TABLE, keyspace(),
column_family());
}
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -40,6 +40,7 @@
*/
#include "cql3/statements/create_keyspace_statement.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
@@ -122,6 +123,11 @@ shared_ptr<transport::event::schema_change> create_keyspace_statement::change_ev
return make_shared<transport::event::schema_change>(transport::event::schema_change::change_type::CREATED, keyspace());
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::create_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_keyspace_statement>(*this));
}
}
}

View File

@@ -84,6 +84,7 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -64,12 +64,6 @@ create_table_statement::create_table_statement(::shared_ptr<cf_name> name,
, _properties{properties}
, _if_not_exists{if_not_exists}
{
if (!properties->has_property(cf_prop_defs::KW_COMPRESSION) && schema::DEFAULT_COMPRESSOR) {
std::map<sstring, sstring> compression = {
{ sstring(compression_parameters::SSTABLE_COMPRESSION), schema::DEFAULT_COMPRESSOR.value() },
};
properties->add_property(cf_prop_defs::KW_COMPRESSION, compression);
}
}
future<> create_table_statement::check_access(const service::client_state& state) {
@@ -156,12 +150,20 @@ void create_table_statement::add_column_metadata_from_aliases(schema_builder& bu
}
}
shared_ptr<prepared_statement>
create_table_statement::prepare(database& db, cql_stats& stats) {
// Cannot happen; create_table_statement is never instantiated as a raw statement
// (instead we instantiate create_table_statement::raw_statement)
abort();
}
create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name, bool if_not_exists)
: cf_statement{std::move(name)}
, _if_not_exists{if_not_exists}
{ }
::shared_ptr<prepared_statement> create_table_statement::raw_statement::prepare(database& db) {
::shared_ptr<prepared_statement> create_table_statement::raw_statement::prepare(database& db, cql_stats& stats) {
// Column family name
const sstring& cf_name = _cf_name->get_column_family();
std::regex name_regex("\\w+");
@@ -180,9 +182,9 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
throw exceptions::invalid_request_exception(sprint("Multiple definition of identifier %s", (*i)->text()));
}
properties->validate();
_properties.validate();
auto stmt = ::make_shared<create_table_statement>(_cf_name, properties, _if_not_exists, _static_columns);
auto stmt = ::make_shared<create_table_statement>(_cf_name, _properties.properties(), _if_not_exists, _static_columns);
std::experimental::optional<std::map<bytes, data_type>> defined_multi_cell_collections;
for (auto&& entry : _definitions) {
@@ -206,7 +208,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
throw exceptions::invalid_request_exception("Multiple PRIMARY KEYs specifed (exactly one required)");
}
stmt->_use_compact_storage = _use_compact_storage;
stmt->_use_compact_storage = _properties.use_compact_storage();
auto& key_aliases = _key_aliases[0];
std::vector<data_type> key_types;
@@ -225,7 +227,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
// Handle column aliases
if (_column_aliases.empty()) {
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
// There should remain some column definition since it is a non-composite "static" CF
if (stmt->_columns.empty()) {
throw exceptions::invalid_request_exception("No definition found that is not part of the PRIMARY KEY");
@@ -238,7 +240,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
} else {
// If we use compact storage and have only one alias, it is a
// standard "dynamic" CF, otherwise it's a composite
if (_use_compact_storage && _column_aliases.size() == 1) {
if (_properties.use_compact_storage() && _column_aliases.size() == 1) {
if (defined_multi_cell_collections) {
throw exceptions::invalid_request_exception("Collection types are not supported with COMPACT STORAGE");
}
@@ -266,7 +268,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
types.emplace_back(type);
}
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
if (defined_multi_cell_collections) {
throw exceptions::invalid_request_exception("Collection types are not supported with COMPACT STORAGE");
}
@@ -279,7 +281,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
if (!_static_columns.empty()) {
// Only CQL3 tables can have static columns
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
throw exceptions::invalid_request_exception("Static columns are not supported in COMPACT STORAGE tables");
}
// Static columns only make sense if we have at least one clustering column. Otherwise everything is static anyway
@@ -288,7 +290,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
}
}
if (_use_compact_storage && !stmt->_column_aliases.empty()) {
if (_properties.use_compact_storage() && !stmt->_column_aliases.empty()) {
if (stmt->_columns.empty()) {
#if 0
// The only value we'll insert will be the empty one, so the default validator don't matter
@@ -314,7 +316,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
} else {
// For compact, we are in the "static" case, so we need at least one column defined. For non-compact however, having
// just the PK is fine since we have CQL3 row marker.
if (_use_compact_storage && stmt->_columns.empty()) {
if (_properties.use_compact_storage() && stmt->_columns.empty()) {
throw exceptions::invalid_request_exception("COMPACT STORAGE with non-composite PRIMARY KEY require one column not part of the PRIMARY KEY, none given");
}
#if 0
@@ -327,18 +329,18 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
}
// If we give a clustering order, we must explicitly do so for all aliases and in the order of the PK
if (!_defined_ordering.empty()) {
if (_defined_ordering.size() > _column_aliases.size()) {
if (!_properties.defined_ordering().empty()) {
if (_properties.defined_ordering().size() > _column_aliases.size()) {
throw exceptions::invalid_request_exception("Only clustering key columns can be defined in CLUSTERING ORDER directive");
}
int i = 0;
for (auto& pair: _defined_ordering){
for (auto& pair: _properties.defined_ordering()){
auto& id = pair.first;
auto& c = _column_aliases.at(i);
if (!(*id == *c)) {
if (find_ordering_info(c)) {
if (_properties.find_ordering_info(c)) {
throw exceptions::invalid_request_exception(sprint("The order of columns in the CLUSTERING ORDER directive must be the one of the clustering key (%s must appear before %s)", c, id));
} else {
throw exceptions::invalid_request_exception(sprint("Missing CLUSTERING ORDER for column %s", c));
@@ -363,12 +365,7 @@ data_type create_table_statement::raw_statement::get_type_and_remove(column_map_
}
columns.erase(t);
auto is_reversed = find_ordering_info(t);
if (!is_reversed) {
return type;
} else {
return *is_reversed ? reversed_type_impl::get_instance(type) : type;
}
return _properties.get_reversable_type(t, type);
}
void create_table_statement::raw_statement::add_definition(::shared_ptr<column_identifier> def, ::shared_ptr<cql3_type::raw> type, bool is_static) {
@@ -387,14 +384,6 @@ void create_table_statement::raw_statement::add_column_alias(::shared_ptr<column
_column_aliases.emplace_back(alias);
}
void create_table_statement::raw_statement::set_ordering(::shared_ptr<column_identifier> alias, bool reversed) {
_defined_ordering.emplace_back(alias, reversed);
}
void create_table_statement::raw_statement::set_compact_storage() {
_use_compact_storage = true;
}
}
}

View File

@@ -43,6 +43,7 @@
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/cf_properties.hh"
#include "cql3/statements/raw/cf_statement.hh"
#include "cql3/cql3_type.hh"
@@ -103,6 +104,8 @@ public:
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
schema_ptr get_cf_meta_data();
class raw_statement;
@@ -123,30 +126,22 @@ private:
shared_ptr_value_hash<column_identifier>,
shared_ptr_equal_by_value<column_identifier>>;
defs_type _definitions;
public:
const ::shared_ptr<cf_prop_defs> properties = ::make_shared<cf_prop_defs>();
private:
std::vector<std::vector<::shared_ptr<column_identifier>>> _key_aliases;
std::vector<::shared_ptr<column_identifier>> _column_aliases;
std::vector<std::pair<::shared_ptr<column_identifier>, bool>> _defined_ordering; // Insertion ordering is important
std::experimental::optional<bool> find_ordering_info(::shared_ptr<column_identifier> type) {
for (auto& t: _defined_ordering) {
if (*(t.first) == *type) {
return t.second;
}
}
return {};
}
create_table_statement::column_set_type _static_columns;
bool _use_compact_storage = false;
std::multiset<::shared_ptr<column_identifier>,
indirect_less<::shared_ptr<column_identifier>, column_identifier::text_comparator>> _defined_names;
bool _if_not_exists;
cf_properties _properties;
public:
raw_statement(::shared_ptr<cf_name> name, bool if_not_exists);
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
cf_properties& properties() {
return _properties;
}
data_type get_type_and_remove(column_map_type& columns, ::shared_ptr<column_identifier> t);
@@ -155,10 +150,6 @@ public:
void add_key_aliases(const std::vector<::shared_ptr<column_identifier>> aliases);
void add_column_alias(::shared_ptr<column_identifier> alias);
void set_ordering(::shared_ptr<column_identifier> alias, bool reversed);
void set_compact_storage();
};
}

View File

@@ -38,6 +38,7 @@
*/
#include "cql3/statements/create_type_statement.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
@@ -155,6 +156,11 @@ future<bool> create_type_statement::announce_migration(distributed<service::stor
return service::get_local_migration_manager().announce_new_type(type, is_local_only).then([] { return true; });
}
shared_ptr<cql3::statements::prepared_statement>
create_type_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_type_statement>(*this));
}
}
}

View File

@@ -69,6 +69,8 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
static void check_for_duplicate_names(user_type type);
private:
bool type_exists_in(::keyspace& ks);

View File

@@ -0,0 +1,344 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <inttypes.h>
#include <regex>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/algorithm/adjacent_find.hpp>
#include "cql3/statements/create_view_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "schema_builder.hh"
#include "service/storage_proxy.hh"
namespace cql3 {
namespace statements {
create_view_statement::create_view_statement(
::shared_ptr<cf_name> view_name,
::shared_ptr<cf_name> base_name,
std::vector<::shared_ptr<selection::raw_selector>> select_clause,
std::vector<::shared_ptr<relation>> where_clause,
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys,
std::vector<::shared_ptr<cql3::column_identifier::raw>> clustering_keys,
bool if_not_exists)
: schema_altering_statement{view_name}
, _base_name{base_name}
, _select_clause{select_clause}
, _where_clause{where_clause}
, _partition_keys{partition_keys}
, _clustering_keys{clustering_keys}
, _if_not_exists{if_not_exists}
{
// TODO: probably need to create a "statement_restrictions" like select does
// based on the select_clause, base_name and where_clause; However need to
// pass for_view=true.
fail(unimplemented::cause::VIEWS);
}
// FIXME: I copied the following from create_table_statement. I don't know
// what they do or whether they need to change for create view.
future<> create_view_statement::check_access(const service::client_state& state) {
return state.has_keyspace_access(keyspace(), auth::permission::CREATE);
}
void create_view_statement::validate(distributed<service::storage_proxy>&, const service::client_state& state) {
// validated in announceMigration()
}
future<bool> create_view_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) {
// FIXME: this code from create_table_view is probably wrong, the Java CreateViewStatement.announceMigration is much more elaborate
#if 0
****** Our implementation in creat_table_statement (simpler code but that was simpler also in Java)
return make_ready_future<>().then([this, is_local_only] {
return service::get_local_migration_manager().announce_new_column_family(get_cf_meta_data(), is_local_only);
}).then_wrapped([this] (auto&& f) {
try {
f.get();
return true;
} catch (const exceptions::already_exists_exception& e) {
if (_if_not_exists) {
return false;
}
throw e;
}
});
#endif
#if 0
***** This if 0 code is from Cassandra CreateViewStatement
// We need to make sure that:
// - primary key includes all columns in base table's primary key
// - make sure that the select statement does not have anything other than columns
// and their names match the base table's names
// - make sure that primary key does not include any collections
// - make sure there is no where clause in the select statement
// - make sure there is not currently a table or view
// - make sure baseTable gcGraceSeconds > 0
properties.validate();
if (properties.useCompactStorage)
throw new InvalidRequestException("Cannot use 'COMPACT STORAGE' when defining a materialized view");
#endif
// View and base tables must be in the same keyspace, to ensure that RF
// is the same (because we assign a view replica to each base replica).
// If a keyspace was not specified for the base table name, it is assumed
// it is in the same keyspace as the view table being created (which
// itself might be the current USEd keyspace, or explicitly specified).
if (_base_name->get_keyspace().empty()) {
_base_name->set_keyspace(keyspace(), true);
}
if (_base_name->get_keyspace() != keyspace()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot create a materialized view on a table in a separate keyspace ('%s' != '%s')",
_base_name->get_keyspace(), keyspace()));
}
// Validate that the keyspace and the base table exist, and is not a
// special table on which we cannot create views:
// CONTINUE HERE: something like the code below taken from migration_manager instead of the validateColumFamily below
auto& db = service::get_local_storage_proxy().get_db().local();
if (!db.has_keyspace(_base_name->get_keyspace())) {
throw exceptions::invalid_request_exception(sprint(
"Keyspace '%s' does not exist", _base_name->get_keyspace()));
}
// auto& ks = db.find_keyspace(_base_name->get_keyspace());
if (!db.has_schema(_base_name->get_keyspace(), _base_name->get_column_family())) {
throw exceptions::invalid_request_exception(sprint(
"Base table '%s' does not exist",
_base_name->get_column_family()));
}
return make_ready_future<bool>(true);
// if (db.has_schema(_base_name->get_keyspace(), _base_name->get_column_family())) {
// throw exceptions::already_exists_exception(cfm->ks_name(), cfm->cf_name());
// }
#if 0
CFMetaData cfm = ThriftValidation.validateColumnFamily(baseName.getKeyspace(), baseName.getColumnFamily());
if (cfm.isCounter())
throw new InvalidRequestException("Materialized views are not supported on counter tables");
if (cfm.isView())
throw new InvalidRequestException("Materialized views cannot be created against other materialized views");
if (cfm.params.gcGraceSeconds == 0)
{
throw new InvalidRequestException(String.format("Cannot create materialized view '%s' for base table " +
"'%s' with gc_grace_seconds of 0, since this value is " +
"used to TTL undelivered updates. Setting gc_grace_seconds" +
" too low might cause undelivered updates to expire " +
"before being replayed.", cfName.getColumnFamily(),
baseName.getColumnFamily()));
}
Set<ColumnIdentifier> included = new HashSet<>();
for (RawSelector selector : selectClause)
{
Selectable.Raw selectable = selector.selectable;
if (selectable instanceof Selectable.WithFieldSelection.Raw)
throw new InvalidRequestException("Cannot select out a part of type when defining a materialized view");
if (selectable instanceof Selectable.WithFunction.Raw)
throw new InvalidRequestException("Cannot use function when defining a materialized view");
if (selectable instanceof Selectable.WritetimeOrTTL.Raw)
throw new InvalidRequestException("Cannot use function when defining a materialized view");
ColumnIdentifier identifier = (ColumnIdentifier) selectable.prepare(cfm);
if (selector.alias != null)
throw new InvalidRequestException(String.format("Cannot alias column '%s' as '%s' when defining a materialized view", identifier.toString(), selector.alias.toString()));
ColumnDefinition cdef = cfm.getColumnDefinition(identifier);
if (cdef == null)
throw new InvalidRequestException("Unknown column name detected in CREATE MATERIALIZED VIEW statement : "+identifier);
included.add(identifier);
}
Set<ColumnIdentifier.Raw> targetPrimaryKeys = new HashSet<>();
for (ColumnIdentifier.Raw identifier : Iterables.concat(partitionKeys, clusteringKeys))
{
if (!targetPrimaryKeys.add(identifier))
throw new InvalidRequestException("Duplicate entry found in PRIMARY KEY: "+identifier);
ColumnDefinition cdef = cfm.getColumnDefinition(identifier.prepare(cfm));
if (cdef == null)
throw new InvalidRequestException("Unknown column name detected in CREATE MATERIALIZED VIEW statement : "+identifier);
if (cfm.getColumnDefinition(identifier.prepare(cfm)).type.isMultiCell())
throw new InvalidRequestException(String.format("Cannot use MultiCell column '%s' in PRIMARY KEY of materialized view", identifier));
if (cdef.isStatic())
throw new InvalidRequestException(String.format("Cannot use Static column '%s' in PRIMARY KEY of materialized view", identifier));
}
// build the select statement
Map<ColumnIdentifier.Raw, Boolean> orderings = Collections.emptyMap();
SelectStatement.Parameters parameters = new SelectStatement.Parameters(orderings, false, true, false);
SelectStatement.RawStatement rawSelect = new SelectStatement.RawStatement(baseName, parameters, selectClause, whereClause, null, null);
ClientState state = ClientState.forInternalCalls();
state.setKeyspace(keyspace());
rawSelect.prepareKeyspace(state);
rawSelect.setBoundVariables(getBoundVariables());
ParsedStatement.Prepared prepared = rawSelect.prepare(true);
SelectStatement select = (SelectStatement) prepared.statement;
StatementRestrictions restrictions = select.getRestrictions();
if (!prepared.boundNames.isEmpty())
throw new InvalidRequestException("Cannot use query parameters in CREATE MATERIALIZED VIEW statements");
if (!restrictions.nonPKRestrictedColumns(false).isEmpty())
{
throw new InvalidRequestException(String.format(
"Non-primary key columns cannot be restricted in the SELECT statement used for materialized view " +
"creation (got restrictions on: %s)",
restrictions.nonPKRestrictedColumns(false).stream().map(def -> def.name.toString()).collect(Collectors.joining(", "))));
}
String whereClauseText = View.relationsToWhereClause(whereClause.relations);
Set<ColumnIdentifier> basePrimaryKeyCols = new HashSet<>();
for (ColumnDefinition definition : Iterables.concat(cfm.partitionKeyColumns(), cfm.clusteringColumns()))
basePrimaryKeyCols.add(definition.name);
List<ColumnIdentifier> targetClusteringColumns = new ArrayList<>();
List<ColumnIdentifier> targetPartitionKeys = new ArrayList<>();
// This is only used as an intermediate state; this is to catch whether multiple non-PK columns are used
boolean hasNonPKColumn = false;
for (ColumnIdentifier.Raw raw : partitionKeys)
hasNonPKColumn |= getColumnIdentifier(cfm, basePrimaryKeyCols, hasNonPKColumn, raw, targetPartitionKeys, restrictions);
for (ColumnIdentifier.Raw raw : clusteringKeys)
hasNonPKColumn |= getColumnIdentifier(cfm, basePrimaryKeyCols, hasNonPKColumn, raw, targetClusteringColumns, restrictions);
// We need to include all of the primary key columns from the base table in order to make sure that we do not
// overwrite values in the view. We cannot support "collapsing" the base table into a smaller number of rows in
// the view because if we need to generate a tombstone, we have no way of knowing which value is currently being
// used in the view and whether or not to generate a tombstone. In order to not surprise our users, we require
// that they include all of the columns. We provide them with a list of all of the columns left to include.
boolean missingClusteringColumns = false;
StringBuilder columnNames = new StringBuilder();
List<ColumnIdentifier> includedColumns = new ArrayList<>();
for (ColumnDefinition def : cfm.allColumns())
{
ColumnIdentifier identifier = def.name;
boolean includeDef = included.isEmpty() || included.contains(identifier);
if (includeDef && def.isStatic())
{
throw new InvalidRequestException(String.format("Unable to include static column '%s' which would be included by Materialized View SELECT * statement", identifier));
}
if (includeDef && !targetClusteringColumns.contains(identifier) && !targetPartitionKeys.contains(identifier))
{
includedColumns.add(identifier);
}
if (!def.isPrimaryKeyColumn()) continue;
if (!targetClusteringColumns.contains(identifier) && !targetPartitionKeys.contains(identifier))
{
if (missingClusteringColumns)
columnNames.append(',');
else
missingClusteringColumns = true;
columnNames.append(identifier);
}
}
if (missingClusteringColumns)
throw new InvalidRequestException(String.format("Cannot create Materialized View %s without primary key columns from base %s (%s)",
columnFamily(), baseName.getColumnFamily(), columnNames.toString()));
if (targetPartitionKeys.isEmpty())
throw new InvalidRequestException("Must select at least a column for a Materialized View");
if (targetClusteringColumns.isEmpty())
throw new InvalidRequestException("No columns are defined for Materialized View other than primary key");
CFMetaData.Builder cfmBuilder = CFMetaData.Builder.createView(keyspace(), columnFamily());
add(cfm, targetPartitionKeys, cfmBuilder::addPartitionKey);
add(cfm, targetClusteringColumns, cfmBuilder::addClusteringColumn);
add(cfm, includedColumns, cfmBuilder::addRegularColumn);
cfmBuilder.withId(properties.properties.getId());
TableParams params = properties.properties.asNewTableParams();
CFMetaData viewCfm = cfmBuilder.build().params(params);
ViewDefinition definition = new ViewDefinition(keyspace(),
columnFamily(),
Schema.instance.getId(keyspace(), baseName.getColumnFamily()),
baseName.getColumnFamily(),
included.isEmpty(),
rawSelect,
whereClauseText,
viewCfm);
try
{
MigrationManager.announceNewView(definition, isLocalOnly);
return new Event.SchemaChange(Event.SchemaChange.Change.CREATED, Event.SchemaChange.Target.TABLE, keyspace(), columnFamily());
}
catch (AlreadyExistsException e)
{
if (ifNotExists)
return null;
throw e;
}
#endif
}
shared_ptr<transport::event::schema_change> create_view_statement::change_event() {
// FIXME: this is probably wrong, I just copied it from create_table_statement
return make_shared<transport::event::schema_change>(transport::event::schema_change::change_type::CREATED, transport::event::schema_change::target_type::TABLE, keyspace(), column_family());
}
shared_ptr<cql3::statements::prepared_statement>
create_view_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_view_statement>(*this));
}
}
}

View File

@@ -0,0 +1,79 @@
/*
* This file is part of Scylla.
* Copyright (C) 2016 ScyllaDB
*
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/cf_properties.hh"
#include "cql3/cql3_type.hh"
#include "cql3/selection/raw_selector.hh"
#include "cql3/relation.hh"
#include "cql3/cf_name.hh"
#include "service/migration_manager.hh"
#include "schema.hh"
#include "core/shared_ptr.hh"
#include <utility>
#include <vector>
#include <experimental/optional>
namespace cql3 {
namespace statements {
/** A <code>CREATE MATERIALIZED VIEW</code> parsed from a CQL query statement. */
class create_view_statement : public schema_altering_statement {
private:
::shared_ptr<cf_name> _base_name;
std::vector<::shared_ptr<selection::raw_selector>> _select_clause;
std::vector<::shared_ptr<relation>> _where_clause;
std::vector<::shared_ptr<cql3::column_identifier::raw>> _partition_keys;
std::vector<::shared_ptr<cql3::column_identifier::raw>> _clustering_keys;
cf_properties _properties;
bool _if_not_exists;
public:
create_view_statement(
::shared_ptr<cf_name> view_name,
::shared_ptr<cf_name> base_name,
std::vector<::shared_ptr<selection::raw_selector>> select_clause,
std::vector<::shared_ptr<relation>> where_clause,
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys,
std::vector<::shared_ptr<cql3::column_identifier::raw>> clustering_keys,
bool if_not_exists);
auto& properties() {
return _properties;
}
// Functions we need to override to subclass schema_altering_statement
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
// FIXME: continue here. See create_table_statement.hh and CreateViewStatement.java
};
}
}

View File

@@ -46,8 +46,8 @@ namespace cql3 {
namespace statements {
delete_statement::delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs)
: modification_statement{type, bound_terms, std::move(s), std::move(attrs)}
delete_statement::delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, cql_stats& stats)
: modification_statement{type, bound_terms, std::move(s), std::move(attrs), &stats.deletes}
{ }
bool delete_statement::require_full_clustering_key() const {
@@ -80,10 +80,10 @@ void delete_statement::add_update_for_key(mutation& m, const exploded_clustering
namespace raw {
::shared_ptr<cql3::statements::modification_statement>
delete_statement::prepare_internal(database& db, schema_ptr schema, ::shared_ptr<variable_specifications> bound_names,
std::unique_ptr<attributes> attrs) {
delete_statement::prepare_internal(database& db, schema_ptr schema, shared_ptr<variable_specifications> bound_names,
std::unique_ptr<attributes> attrs, cql_stats& stats) {
using statement_type = cql3::statements::modification_statement::statement_type;
auto stmt = ::make_shared<cql3::statements::delete_statement>(statement_type::DELETE, bound_names->size(), schema, std::move(attrs));
auto stmt = ::make_shared<cql3::statements::delete_statement>(statement_type::DELETE, bound_names->size(), schema, std::move(attrs), stats);
for (auto&& deletion : _deletions) {
auto&& id = deletion->affected_column()->prepare_column_identifier(schema);

View File

@@ -56,7 +56,7 @@ namespace statements {
*/
class delete_statement : public modification_statement {
public:
delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs);
delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, cql_stats& stats);
virtual bool require_full_clustering_key() const override;

View File

@@ -40,6 +40,7 @@
*/
#include "cql3/statements/drop_keyspace_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "service/migration_manager.hh"
#include "transport/event.hh"
@@ -97,6 +98,11 @@ shared_ptr<transport::event::schema_change> drop_keyspace_statement::change_even
}
shared_ptr<cql3::statements::prepared_statement>
drop_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_keyspace_statement>(*this));
}
}
}

View File

@@ -62,6 +62,8 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -40,6 +40,7 @@
*/
#include "cql3/statements/drop_table_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "service/migration_manager.hh"
@@ -98,6 +99,11 @@ shared_ptr<transport::event::schema_change> drop_table_statement::change_event()
column_family());
}
shared_ptr<cql3::statements::prepared_statement>
drop_table_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_table_statement>(*this));
}
}
}

View File

@@ -61,6 +61,8 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -38,6 +38,7 @@
*/
#include "cql3/statements/drop_type_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "boost/range/adaptor/map.hpp"
@@ -116,6 +117,11 @@ future<bool> drop_type_statement::announce_migration(distributed<service::storag
return service::get_local_migration_manager().announce_type_drop(to_drop->second, is_local_only).then([] { return true; });
}
shared_ptr<cql3::statements::prepared_statement>
drop_type_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_type_statement>(*this));
}
}
}

View File

@@ -64,6 +64,8 @@ public:
virtual const sstring& keyspace() const override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -73,12 +73,13 @@ operator<<(std::ostream& out, modification_statement::statement_type t) {
return out;
}
modification_statement::modification_statement(statement_type type_, uint32_t bound_terms, schema_ptr schema_, std::unique_ptr<attributes> attrs_)
modification_statement::modification_statement(statement_type type_, uint32_t bound_terms, schema_ptr schema_, std::unique_ptr<attributes> attrs_, uint64_t* cql_stats_counter_ptr)
: type{type_}
, _bound_terms{bound_terms}
, s{schema_}
, attrs{std::move(attrs_)}
, _column_operations{}
, _cql_modification_counter_ptr(cql_stats_counter_ptr)
{ }
bool modification_statement::uses_function(const sstring& ks_name, const sstring& function_name) const {
@@ -447,10 +448,14 @@ modification_statement::execute(distributed<service::storage_proxy>& proxy, serv
throw exceptions::invalid_request_exception("Conditional updates are not supported by the protocol version in use. You need to upgrade to a driver using the native protocol v2.");
}
tracing::add_table_name(qs.get_trace_state(), keyspace(), column_family());
if (has_conditions()) {
return execute_with_condition(proxy, qs, options);
}
inc_cql_stats();
return execute_without_condition(proxy, qs, options).then([] {
return make_ready_future<::shared_ptr<transport::messages::result_message>>(
::shared_ptr<transport::messages::result_message>{});
@@ -508,6 +513,11 @@ modification_statement::execute_internal(distributed<service::storage_proxy>& pr
if (has_conditions()) {
throw exceptions::unsupported_operation_exception();
}
tracing::add_table_name(qs.get_trace_state(), keyspace(), column_family());
inc_cql_stats();
return get_mutations(proxy, options, true, options.get_timestamp(qs), qs.get_trace_state()).then(
[&proxy] (auto mutations) {
return proxy.local().mutate_locally(std::move(mutations));
@@ -568,20 +578,20 @@ modification_statement::process_where_clause(database& db, std::vector<relation_
namespace raw {
::shared_ptr<prepared_statement>
modification_statement::modification_statement::prepare(database& db) {
modification_statement::modification_statement::prepare(database& db, cql_stats& stats) {
auto bound_names = get_bound_variables();
auto statement = prepare(db, bound_names);
auto statement = prepare(db, bound_names, stats);
return ::make_shared<prepared>(std::move(statement), *bound_names);
}
::shared_ptr<cql3::statements::modification_statement>
modification_statement::prepare(database& db, ::shared_ptr<variable_specifications> bound_names) {
modification_statement::prepare(database& db, ::shared_ptr<variable_specifications> bound_names, cql_stats& stats) {
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
auto prepared_attributes = _attrs->prepare(db, keyspace(), column_family());
prepared_attributes->collect_marker_specification(bound_names);
::shared_ptr<cql3::statements::modification_statement> stmt = prepare_internal(db, schema, bound_names, std::move(prepared_attributes));
::shared_ptr<cql3::statements::modification_statement> stmt = prepare_internal(db, schema, bound_names, std::move(prepared_attributes), stats);
if (_if_not_exists || _if_exists || !_conditions.empty()) {
if (stmt->is_counter()) {

View File

@@ -109,8 +109,10 @@ private:
return cond->column;
};
uint64_t* _cql_modification_counter_ptr = nullptr;
public:
modification_statement(statement_type type_, uint32_t bound_terms, schema_ptr schema_, std::unique_ptr<attributes> attrs_);
modification_statement(statement_type type_, uint32_t bound_terms, schema_ptr schema_, std::unique_ptr<attributes> attrs_, uint64_t* cql_stats_counter_ptr);
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
@@ -152,6 +154,11 @@ public:
staticConditions == null ? Collections.<ColumnDefinition>emptyList() : Iterables.transform(staticConditions, getColumnForCondition));
}
#endif
void inc_cql_stats() {
++(*_cql_modification_counter_ptr);
}
public:
void add_condition(::shared_ptr<column_condition> cond);

View File

@@ -181,6 +181,21 @@ long property_definitions::to_long(sstring key, std::experimental::optional<sstr
}
}
void property_definitions::remove_from_map_if_exists(const sstring& name, const sstring& key)
{
auto it = _properties.find(name);
if (it == _properties.end()) {
return;
}
try {
auto map = boost::any_cast<std::map<sstring, sstring>>(it->second);
map.erase(key);
_properties[name] = map;
} catch (const boost::bad_any_cast& e) {
throw exceptions::syntax_exception(sprint("Invalid value for property '%s'. It should be a map.", name));
}
}
}
}

View File

@@ -79,6 +79,7 @@ protected:
std::experimental::optional<std::map<sstring, sstring>> get_map(const sstring& name) const;
void remove_from_map_if_exists(const sstring& name, const sstring& key);
public:
bool has_property(const sstring& name) const;

View File

@@ -84,7 +84,7 @@ public:
}
}
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -66,7 +66,7 @@ public:
bool if_exists);
protected:
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs);
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats);
};
}

View File

@@ -77,7 +77,7 @@ public:
bool if_not_exists);
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs) override;
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats) override;
};

View File

@@ -83,11 +83,11 @@ protected:
modification_statement(::shared_ptr<cf_name> name, ::shared_ptr<attributes::raw> attrs, conditions_vector conditions, bool if_not_exists, bool if_exists);
public:
virtual ::shared_ptr<prepared> prepare(database& db) override;
::shared_ptr<cql3::statements::modification_statement> prepare(database& db, ::shared_ptr<variable_specifications> bound_names);;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
::shared_ptr<cql3::statements::modification_statement> prepare(database& db, ::shared_ptr<variable_specifications> bound_names, cql_stats& stats);
protected:
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs) = 0;
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats) = 0;
};
}

View File

@@ -44,6 +44,7 @@
#include "cql3/variable_specifications.hh"
#include "cql3/column_specification.hh"
#include "cql3/column_identifier.hh"
#include "cql3/stats.hh"
#include <seastar/core/shared_ptr.hh>
@@ -70,7 +71,7 @@ public:
void set_bound_variables(const std::vector<::shared_ptr<column_identifier>>& bound_names);
virtual ::shared_ptr<prepared> prepare(database& db) = 0;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const;
};

View File

@@ -100,7 +100,7 @@ public:
std::vector<::shared_ptr<relation>> where_clause,
::shared_ptr<term::raw> limit);
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual ::shared_ptr<prepared> prepare(database& db,cql_stats& stats) override;
private:
::shared_ptr<restrictions::statement_restrictions> prepare_restrictions(
database& db,

View File

@@ -81,7 +81,7 @@ public:
conditions_vector conditions);
protected:
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs);
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats);
};
}

View File

@@ -58,7 +58,7 @@ private:
public:
use_statement(sstring keyspace);
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -86,11 +86,6 @@ void schema_altering_statement::prepare_keyspace(const service::client_state& st
}
}
::shared_ptr<prepared_statement> schema_altering_statement::prepare(database& db)
{
return ::make_shared<prepared>(this->shared_from_this());
}
future<::shared_ptr<messages::result_message>>
schema_altering_statement::execute0(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options, bool is_local_only) {
// If an IF [NOT] EXISTS clause was used, this may not result in an actual schema change. To avoid doing

Some files were not shown because too many files have changed in this diff Show More