Compare commits

...

581 Commits

Author SHA1 Message Date
Avi Kivity
bd807660e6 Update seastar submodule
* seastar 63079b7...725269f (1):
  > build: work around ragel 7 generated code bug
2017-08-09 14:22:28 +03:00
Avi Kivity
49894b2504 Update seastar submodule
* seastar 4a31f27...63079b7 (2):
  > build: export full cflags in pkgconfig file
  > build: disable -Wattributes when gcc -fvisibility=hidden bug strikes

Allows building with gcc 6.4.
2017-08-09 13:08:55 +03:00
Avi Kivity
79990908f4 Revert "sstable: close sstable_writer's file if writing of sstable fails."
This reverts commit 86cecd0d7d. Was already there as
3edcee4821.
2017-07-19 17:47:23 +03:00
Gleb Natapov
86cecd0d7d sstable: close sstable_writer's file if writing of sstable fails.
Failing to close a file properly before destroying file's object causes
crashes.

[tgrabiec: fixed typo]

Message-Id: <20170221144858.GG11471@scylladb.com>
(cherry picked from commit 0977f4fdf8)
2017-07-19 17:28:42 +03:00
Raphael S. Carvalho
dd7052023b tests/sstable_test: fix compaction_manager_test
after 'compaction: make major compaction go through compaction manager',
the test fails because task is preempted in debug mode before it reaches
intruction to increase stat.

Backport from 8dfb5f9c33
Included single-line change in compaction_manager_test from
8b0e358d73 that fixes release mode too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170714194946.12467-1-raphaelsc@scylladb.com>
2017-07-15 08:05:35 +03:00
Pekka Enberg
f8b81c2565 release: prepare for 1.6.6 2017-07-14 12:12:32 +03:00
Duarte Nunes
cb55afd824 storage_proxy: Preserve replica order across mutations
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.

There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.

This patch fixes this and enforces the expected order.

Fixes #2531
Fixes #2593

Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162400.15968-1-duarte@scylladb.com>
2017-07-14 10:12:01 +02:00
Raphael S. Carvalho
d8e86faf95 compaction: make major compaction go through compaction manager
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.

Fixes #1156.

NOTE: resolved conflict because in this version manager doesn't
stop transportation services for io errors.

(cherry picked from commit 3286f7aaa6)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170614211923.9955-2-raphaelsc@scylladb.com>
2017-06-15 12:42:48 +03:00
Raphael S. Carvalho
159fa22082 database: serialize sstable cleanup
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.

Fixes #192.

(cherry picked from commit 7deeffc953)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170614211923.9955-1-raphaelsc@scylladb.com>
2017-06-15 12:42:40 +03:00
Calle Wilund
8ef146d6ca commitlog_test: Fix test_commitlog_delete_when_over_disk_limit
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
    end and inbetwen. I.e. the test sort of assumed we started
    with < 2 (or so) segments. Not always the case (timing)

Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 0c598e5645)
2017-06-13 19:53:32 +03:00
Paweł Dziepak
1b9e0766b3 commitlog: avoid copying column_mapping
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.

This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.

Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
          before       after      diff
write   79434.35    89247.54    +12.3%

(cherry picked from commit 374c8a56ac)

Also: Fixes #2468.
2017-06-11 15:56:48 +03:00
Paweł Dziepak
67f7932a35 idl: fix generated writers when member functions are used
When using member name in an idetifer of generated class or method
idl compiler should strip the trailing '()'.

(cherry picked from commit 4df4994b71)

(part of #2468)
2017-06-11 15:56:45 +03:00
Paweł Dziepak
991d6bdb77 idl: add start_frame() overload for seastar::simple_output_stream
(cherry picked from commit 018d16d315)

(part of #2468)
2017-06-11 15:56:43 +03:00
Pekka Enberg
9699319194 release: prepare for 1.6.5 2017-06-09 07:53:34 +03:00
Raphael S. Carvalho
788b7bfb46 db: fix computation of live disk usage stat after compaction
sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.

Fixes #1592

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
(cherry picked from commit 3b5ad23532)
2017-05-28 11:32:01 +03:00
Raphael S. Carvalho
535e9ff95f db: remove partial sstable created by memtable flush which failed
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.

Fixes #2407.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
(cherry picked from commit b7e1575ad4)
2017-05-25 11:50:32 +03:00
Asias He
1648301411 repair: Fix partition estimation
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.

The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.

Fixes #2299

Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
(cherry picked from commit 66e3b73b9c)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

Conflicts:
	repair/repair.cc

Glauber: conflict is due to the fact that dht::token_range is still called nonwrapping_range
Message-Id: <20170522015356.4236-1-glauber@scylladb.com>
2017-05-22 08:31:43 +03:00
Pekka Enberg
43a62ab5e1 Revert "repair: Fix partition estimation"
This reverts commit 54db1b5452. Glauber writes:

 "I forgot one git ammend, as it seems. The token type had to be changed
  in two places for it to compile, and I ended up including just one. I
  will send another fixed version soon."
2017-05-22 08:31:12 +03:00
Tomasz Grabiec
42a4fbe528 row_cache: Fix undefined behavior in read_wide()
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.

Fixes #2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0351ab8bc6)
2017-05-21 19:14:23 +03:00
Gleb Natapov
38e9683753 database: remove temporary sstables sequentially
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.

Fixes #2384.

Message-Id: <20170517111722.GY3874@scylladb.com>
2017-05-21 17:12:19 +03:00
Asias He
54db1b5452 repair: Fix partition estimation
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.

The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.

Fixes #2299

Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
(cherry picked from commit 66e3b73b9c)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

Conflicts:
	repair/repair.cc

Glauber: conflict is due to the fact that dht::token_range is still called nonwrapping_range
2017-05-21 17:04:09 +03:00
Tomasz Grabiec
f8f1d1d300 Update seastar submodule head
* dist/ami/files/scylla-ami d5a4397...407e8f3 (4):
  > scylla_create_devices: check block device is exists
  > Rewrite disk discovery to handle EBS and NVMEs.
  > add --developer-mode option
  > trivial cleanup: replace tab in indent

* seastar 5810d50...4a31f27 (6):
  > tls: make shutdown/close do "clean" handshake shutdown in background
  > tls: Make sink/source (i.e. streams) first class channel owners
  > tls.cc: Fix shutdown_input/output to conform with expected socket behaviour
  > net/*: qualify net::* refs better -> ::net::* etc
  > native-stack: Make sink/source (i.e. streams) first class channel owners
  > posix-stack: Make sink/source (i.e. streams) first class channel owners
2017-05-19 14:19:40 +02:00
Pekka Enberg
c257850e25 Merge "gossip mark alive fixes" from Asias
"This series fixes the user after free issue in gossip and elimates the
duplicated / unnecessary mark alive operations.

Fixes #2341"

* tag 'asias/gossip_fix_mark_alive/v1' of github.com:cloudius-systems/seastar-dev:
  gossip: Ignore callbacks and mark alive operation in shadow round
  gossip: Ingore the duplicated mark alive operation
  gossip: Fix user after free in mark_alive

(cherry picked from commit 1e04731fa0)
2017-05-09 01:57:46 +03:00
Pekka Enberg
e8e5c88a02 Update seastar submodule
* seastar 8e54a9b...5810d50 (1):
  > scripts: posix_net_conf.sh: get rid of funny syntax

Fixes #2269
2017-05-03 10:34:50 +03:00
Avi Kivity
4f4edec8e1 Merge "] Fix problems with slicing using sstable's promoted index" from Tomasz
"Fixes #2327.
Fixes #2326."

* 'tgrabiec/fix-promoted-index-parsing-1.7' of github.com:cloudius-systems/seastar-dev:
  sstables: Fix incorrect parsing of cell names in promoted index
  sstables: Fix find_disk_ranges() to not miss relevant range tombstones

(cherry picked from commit ea0591ad3d)
2017-04-30 17:04:21 +03:00
Tomasz Grabiec
65862c4221 sstables: Fix usage of wrong comparator in find_disk_ranges()
This made a difference if clustering restriction bounds were not full
keys but prefixes.

Fixes #2272.

Message-Id: <1493058357-24156-1-git-send-email-tgrabiec@scylladb.com>
2017-04-24 21:56:33 +03:00
Pekka Enberg
7672c59873 release: prepare for 1.6.4 2017-04-24 19:39:09 +03:00
Duarte Nunes
19212a059f alter_type_statement: Fix signed to unsigned conversion
This could allow us to alter a non-existing field of an UDT.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170419114254.5582-1-duarte@scylladb.com>
(cherry picked from commit e06bafdc6c)
2017-04-19 14:48:38 +03:00
Raphael S. Carvalho
bbbb4dffbd partitioned_sstable_set: fix quadratic space complexity
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.

Fixes #2287.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
(cherry picked from commit 11b74050a1)
2017-04-18 13:14:13 +03:00
Asias He
2f0970e83c gossip: Fix possible use-after-free of entry in endpoint_state_map
We take a reference of endpoint_state entry in endpoint_state_map. We
access it again after code which defers, the reference can be invalid
after the defer if someone deletes the entry during the defer.

Fix this by checking take the reference again after the defering code.

I also audited the code to remove unsafe reference to endpoint_state_map entry
as much as possible.

Fixes the following SIGSEGV:

Core was generated by `/usr/bin/scylla --log-to-syslog 1 --log-to-stdout
0 --default-log-level info --'.
Program terminated with signal SIGSEGV, Segmentation fault.
(this=<optimized out>) at /usr/include/c++/5/bits/stl_pair.h:127
127     in /usr/include/c++/5/bits/stl_pair.h
[Current thread is 1 (Thread 0x7f1448f39bc0 (LWP 107308))]

Fixes #2271

Message-Id: <529ec8ede6da884e844bc81d408b93044610afd2.1491960061.git.asias@scylladb.com>
(cherry picked from commit d27b47595b)
2017-04-13 13:18:54 +03:00
Pekka Enberg
5ab8601f64 release: prepare for 1.6.3 2017-04-04 14:10:48 +03:00
Takuya ASADA
993d216f23 dist/debian/debian/scylla-server.upstart: export SCYLLA_CONF, SCYLLA_HOME
We are sourcing sysconfig file on upstart, but forgot to load them as
environment variables.
So export them.

Fixes #2236

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491209505-32293-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit b087616a6c)
2017-04-04 11:00:17 +03:00
Calle Wilund
630425f513 messaging_service: Move log printout to actual listen start
Fixes  #1845
Log printout was before we actually had evaluated endpoint
to create, thus never included SSL info.
Message-Id: <1487766738-27797-1-git-send-email-calle@scylladb.com>

(cherry picked from commit d5f57bd047)
(cherry picked from commit 8c0488bce9)
2017-03-30 18:29:32 +03:00
Calle Wilund
1626c9ce5e commitlog/replayer: Bugfix: minimum rp broken, and cl reader offset too
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.

This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.

Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.

I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.

v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>

(cherry picked from commit b12b65db92)
2017-03-28 11:16:17 +02:00
Calle Wilund
bc6def4a0f commitlog_replayer: Do proper const-loopup of min positions for shards
Fixes #2173

Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.

Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.

Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
(cherry picked from commit c3a510a08d)
2017-03-22 18:46:17 +02:00
Vlad Zolotarov
1e77bcd60f Don't report a Tracing session ID unless the current query had a Tracing bit in its flags
Although the current master's behaviour is legal it's suboptimal and some Clients are sensitive to that.
Let's fix that.

Fixes #2179

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490115157-4657-1-git-send-email-vladz@scylladb.com>
2017-03-22 14:56:20 +02:00
Pekka Enberg
f9bcedfbf7 dist/docker: Expose Prometheus port by default
This patch exposes Scylla's Prometheus port by default. You can now use
the Scylla Monitoring project with the Docker image:

  https://github.com/scylladb/scylla-grafana-monitoring

To configure the IP addresses, use the 'docker inspect' command to
determine Scylla's IP address (assuming your running container is called
'some-scylla'):

  docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla

and then use that IP address in the prometheus/scylla_servers.yml
configuration file.

Fixes #1827

Message-Id: <1490008357-19627-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 85a127bc78)
2017-03-20 15:30:28 +02:00
Calle Wilund
c814cb3433 commitlog_replayer: Make replay parallel per shard
Fixes #2098

Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.

v2:
* Fixed whitespace errors

Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 078589c508)
2017-03-20 12:50:43 +02:00
Amos Kong
62143d26b4 scylla_setup: match '-p' option of lsblk with strict pattern
On Ubuntu 14.04, the lsblk doesn't have '-p' option, but
`scylla_setup` try to get block list by `lsblk -pnr` and
trigger error.

Current simple pattern will match all help content, it might
match wrong options.
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e -p
   -m, --perms          output info about permissions
   -P, --pairs          use key="value" output format

Let's use strict pattern to only match option at the head. Example:
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e '^\s*-D'
   -D, --discard        print discard capabilities

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <4f0f318353a43664e27da8a66855f5831457f061.1489712867.git.amos@scylladb.com>
(cherry picked from commit 468df7dd5f)
2017-03-20 11:16:21 +02:00
Amnon Heiman
7a1ceadec4 storage_proxy: metrics should have unique name
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.

Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.

This could be related to scylladb/seastar#250

Fixes #2163.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
(cherry picked from commit 295a981c61)
2017-03-19 18:11:58 +02:00
Pekka Enberg
d41ba3ba78 release: prepare for 1.6.2 2017-03-10 11:24:19 +02:00
Asias He
0e11c6218a repair: Fix midpoint is not contained in the split range assertion in split_and_add
We have:

  auto halves = range.split(midpoint, dht::token_comparator());

We saw a case where midpoint == range.start, as a result, range.split
will assert becasue the range.start is marked non-inclusive, so the
midpoint doesn't appear to be contain()ed in the range - hence the
assertion failure.

Fixes #2148

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>
(cherry picked from commit 39d2e59e7e)
2017-03-09 10:38:54 +02:00
Paweł Dziepak
a096a173bf Merge "Avoid loosing changes to keyspace parameters of system_auth and tracing keyspaces" form Tomek
"If a node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.

To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.

Fixes #2129."

* tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev:
  db: Create default auth and tracing keyspaces using lowest timestamp
  migration_manager: Append actual keyspace mutations with schema notifications

(cherry picked from commit 6db6d25f66)
2017-03-08 20:37:10 +01:00
Nadav Har'El
13d52a8172 sstable decompression: fix skip() to end of file
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.

Fixes #2143

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
(cherry picked from commit 506e074ba4)
2017-03-08 12:35:47 +02:00
Gleb Natapov
14b8a49b3f memtable: do not open code logalloc::reclaim_lock use
logalloc::reclaim_lock prevents reclaim from running which may cause
regular allocation to fail although there is enough of free memory.
To solve that there is an allocation_section which acquire reclaim_lock
and if allocation fails it run reclaimer outside of a lock and retries
the allocation. The patch make use of allocation_section instead of
direct use of reclaim_lock in memtable code.

Fixes #2138.

Message-Id: <20170306160050.GC5902@scylladb.com>
(cherry picked from commit d7bdf16a16)
2017-03-07 11:16:47 +02:00
Gleb Natapov
e437edafe6 memtable: do not yield while holding reclaim_lock
Holding reclaim_lock while yielding may cause memory allocations to
fail.

Fixes #2139

Message-Id: <20170306153151.GA5902@scylladb.com>
(cherry picked from commit 5c4158daac)
2017-03-06 18:36:20 +02:00
Takuya ASADA
b91456da90 dist/debian/dep: fix broken link of gcc-5, update it to 5.4.1-5
Since gcc-5/stretch=5.4.1-2 removed from apt repository, we nolonger able to
build gcc-5.

To avoid dead link, use launchpad.net archives instead of using apt-get source.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488189378-5607-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ba323e2074)
2017-03-06 16:32:22 +02:00
Tomasz Grabiec
ebb404275d db: Fix overflow of gc_clock time point
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.

I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.

Fixes the following UBSAN report:

  /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'

Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 4b6e77e97e)
2017-03-01 18:50:42 +02:00
Tomasz Grabiec
92e6c62c1c query: Fix invalid initialization of _memory_tracker by moving-from-self
Fixes the following UBSAN warning:

  core/semaphore.hh:293:74: runtime error: reference binding to misaligned address 0x0000006c55d7 for type 'struct basic_semaphore', which requires 8 byte alignment

Since the field was not initialied properly, probably also fixes some
user-visible bug.
Message-Id: <1488368222-32009-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0c84f00b16)
2017-03-01 11:57:21 +00:00
Gleb Natapov
3edcee4821 sstable: close sstable_writer's file if writing of sstable fails.
Failing to close a file properly before destroying file's object causes
crashes.

[tgrabiec: fixed typo]

Message-Id: <20170221144858.GG11471@scylladb.com>
(cherry picked from commit 0977f4fdf8)
2017-02-28 11:10:17 +02:00
Avi Kivity
9c148d6e4c Update seastar submodule
* seastar ee84851...8e54a9b (1):
  > fix append_challenged_posix_file_impl::process_queue() to handle recursion

Fixes #2121.
2017-02-28 10:57:22 +02:00
Shlomi Livne
2ea4da28e8 release: prepare for 1.6.1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-02-16 22:37:07 +02:00
Takuya ASADA
a04722aeb2 dist/redhat: stop backporting ninja-build from Fedora, install it from EPEL instead
ninja-build-1.6.0-2.fc23.src.rpm on fedora web site deleted for some
reason, but there is ninja-build-1.7.2-2 on EPEL, so we don't need to
backport from Fedora anymore.

Fixes #2087

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487155729-13257-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 9c8515eeed)
2017-02-15 13:01:19 +02:00
Avi Kivity
41b8dc3b84 Update seastar submodule
* seastar 0bfd7fe...ee84851 (1):
  > prometheus: send one MetricFamily per unique metric name

Fixes #2077.
Fixes #2078.
2017-02-13 16:12:26 +02:00
Avi Kivity
b6ebe2e20b Merge "Avoid avalanche of tasks after memtable flush" from Tomasz
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()

(cherry picked from commit 7a00dd6985)
2017-02-02 22:19:25 +01:00
Pekka Enberg
7e1b245887 release: prepare for 1.6.0 2017-02-01 13:58:06 +02:00
Takuya ASADA
83fc7de65f dist: add lspci on dependencies, since it used by dpdk-devbind.py
On minimum setup environment scylla_sysconfig_setup will fail because lspci command is not installed. So install it on package installation time.

Fixes #2035

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485327435-20543-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit bce0fb3fa2)
2017-01-25 17:37:18 +02:00
Pekka Enberg
d4781f2de3 release: prepare for 1.6.rc2 2017-01-24 14:33:12 +02:00
Amos Kong
2193a83a82 dist/redhat: fix path of housekeeping.cfg
scylla-housekeeping[3857]: Config file /etc/scylla.d/housekeeping.cfg is missing, terminating

Housekeeping failed to execute for missing the config file,
the config file should be in /etc/scylla.d/.

Fixes #2020

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <e63f2f8cb94410a6dca4e6193932f0079755ad47.1484724328.git.amos@scylladb.com>
(cherry picked from commit b880bdccef)
2017-01-19 11:09:41 +02:00
Pekka Enberg
78d74bf23a dist/docker: Use Scylla 1.6 RPM repository 2017-01-18 11:58:42 +02:00
Pekka Enberg
642a479c73 release: prepare for 1.6.rc1 2017-01-16 19:02:05 +02:00
Tomasz Grabiec
8a6d0ad2fa storage_proxy: Fix capturing of on-stack variable by reference
partition_range_count was accepted by do_with callback by value and
then captured by reference by async code, thus invoking use after
destroy.

Message-Id: <1484317846-14485-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3c3a4358ae)
2017-01-16 11:49:34 +02:00
Tomasz Grabiec
37f73781ee storage_proxy: Add missing initialization of _short_read_allowed
Dropped by a1cafed370 ("storage_proxy:
handle range scans of sparsely populated tables").

Fixes the failure in update_cluster_layout_tests.TestUpdateClusterLayout test.

Message-Id: <1484317450-13525-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 66547e7d7c)
2017-01-13 16:49:30 +02:00
Takuya ASADA
6fd5442fb7 scylla-housekeeping: move uuid file to /var/lib/scylla-housekeeping
Since scylla-housekeeping running as scylla user, it doesn't have a permission
to create a file on /etc/scylla.d.
So introduce /var/lib/scylla-housekeeping which owns by scylla user, place uuid
file on the directory.

Fixes #2009

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484235946-12463-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit bee7f549a9)
2017-01-13 16:28:36 +02:00
Avi Kivity
e2777e508c Update seastar submodule
* seastar 2909f6c...0bfd7fe (1):
  > io_queue: remove owner number from metric name
2017-01-12 16:46:25 +02:00
Tomasz Grabiec
8cf7bbf208 storage_proxy: Fix use-after-free on one_or_two_partition_ranges
query_mutations_locally() takes one_or_two_partition_ranges by
reference and requires, indirectly, that it is kept alive until
operation resolves. However, we were passing expiring value to it, the
result of unwrap().

Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test

Another potential problem was that we were dereferencing "s" in the same
expression which move-constructs an argument out of it.

Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 1e8151b4f2)
2017-01-12 15:12:06 +02:00
Vlad Zolotarov
4b5742d3a6 tracing::trace_keyspace_helper: use generate_legacy_id() for CF IDs generation
Explicitly generate tables' IDs of tables from the system_traces KS  using
generate_legacy_id() in order to ensure all Nodes create these tables with
the same IDs.

This is going to prevent hitting issue #420.

Fixes #1976

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1484153725-31030-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit ca0a0f1458)
2017-01-12 11:36:52 +02:00
Avi Kivity
cf27d44412 config: disable new sharding algorithm
It still has problems:
 - while resharding a very large leveled compaction strategy table, a huge
   amount of tiny sstables are generated, overwhelming the file descriptor
   limits
 - there is a large impact on read latency while resharding is going on
2017-01-12 10:15:51 +02:00
Avi Kivity
7b40e19561 Update seastar submodule
* seastar 0b49f28...2909f6c (2):
  > file/dup: don't decrease refcnt twice when file is explicitly closed
  > file: add dup() support

Preparing for file descriptor reduction during resharding backport.
2017-01-10 16:59:26 +02:00
Duarte Nunes
210e66b2b8 query_pagers: Fix over-counting of rows
This patch fixes a regression introduced in 0518895, where we counted
one extra row per partition when it contained live, non static rows.

We also simplify the visitor logic further, since now we don't need to
count rows one by one. Also remove a bunch of unused fields.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482234083-2447-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit d7e607ff51)
2017-01-10 16:54:09 +02:00
Amnon Heiman
6af51c1b1d scylla_setup: remove the uuid file creation
Scylla housekeeping can crete a uuid file if it is missing. There is no
longer need to create one for it.

Fixes #2004

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-3-git-send-email-amnon@scylladb.com>
(cherry picked from commit 8cd3d7445c)
2017-01-09 17:00:49 +02:00
Amnon Heiman
05f2bf5bd5 scylla-housekeeping: Create a uuid file if one is missing
This patch gets housekeeping to create a uuid file if a path to a uuid
file is upplied but the file is missing.

Because it import the uuid lib, uuid parameters where renamed.

Fixes #1987

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-2-git-send-email-amnon@scylladb.com>
(cherry picked from commit 32888fc0aa)
2017-01-09 15:27:04 +02:00
Avi Kivity
8002326f80 storage_proxy: prevent short read due to buffer size limit from being swallowed during range scan
mutation_result_merger::get() assumes that the merged result may be a
short read if at least one of the partial results is a short read (in
other words, if none of the partial results is a short read, then the
merged result is also not a short read). However this is not true;
because we update the memory accounter incrementally, we may stop
scanning early. All the partial results are full; but we did not scan
the entire range.

Fix by changing the short_read variable initialization from `no`
(which assumes we'll encounter a short read indication when processing
one of the batches) to `this->short_read()`, which also takes into
account the memory accounter.

Fixes #2001.
Message-Id: <20170108111315.17877-1-avi@scylladb.com>

(cherry picked from commit 8f36dca6f1)
2017-01-09 09:27:56 +00:00
Tomasz Grabiec
a00a1a1044 db: Make system tables use the commitlog
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.

Fixes #1986.

Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.

Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.

Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit cd630fece6)
2017-01-05 14:54:11 +02:00
Avi Kivity
337d6fb2cf storage_proxy: fix result ordering for parallel partition range scans
During a range scan, we try to avoid sorting according to partition range
when we can do so.  This is when we scan fewer than smp::count shards --
each shard's range is strictly ordered with respect to the others.

However, we use the wrong key for the sort -- we use the shard number.  But
if we started at shard s > 0 and wrapped around to shard 0, then shard 0's
range will be after the range belonging to shard s, but will sort before it.

Fix by storing the iteration order as the sort key.  We use that when we
know that shards do not overlap (shards < smp::count) and the index within
the source partition range vector when they do.

Fixes #1998.
Message-Id: <20170105114253.17492-1-avi@scylladb.com>

(cherry picked from commit eb520e7352)
2017-01-05 12:56:33 +01:00
Avi Kivity
1a8211e573 result_memory_tracker: fix too-short short reads
1.6 truncates paged queries early to avoid overrunning server memory
with too-large query results, but in the case of partition range queries,
this terminates too early due to an uninitialized variable holding the
maximum result size.  This results in slow performance due to additional
round trips.

Fix by initializing the maximum result size from the result_memory_tracker
running on the coordinating shard.

Fixes #1995.
Message-Id: <20170105103915.10633-1-avi@scylladb.com>

(cherry picked from commit 4667641f5f)
2017-01-05 11:04:03 +00:00
Takuya ASADA
a1a6c10964 dist/redhat: add python-setuptools on dependency since it requires for scylla-housekeeping
scylla-housekeeping breaks when python-setuptools doesn't installed, so
add it on dependency.

Fixes #1884

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483525828-7507-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 43655512e1)
2017-01-04 14:32:52 +02:00
Avi Kivity
c692824786 Update seastar submodule
* seastar 1e45fc8...0b49f28 (2):
  > metrics: Metrics function should take variable as a refernce
  > collectd: create metrics with the right format
2017-01-04 12:41:33 +02:00
Gleb Natapov
0ccdbbf1af storage_proxy: do not deref unengaged stdx:optional
Fixes intentional short reads.

Message-Id: <20161227142133.GE1829@scylladb.com>
(cherry picked from commit 4ca58959ad)
2017-01-01 12:16:45 +02:00
Takuya ASADA
138ad64cbc dist/ubuntu: check lsb_release existance since it's not included minimal Debian installation
Ubuntu has it in minimal installation but Debian doesn't, so add it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483003565-2753-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit e48cc9cf01)
2016-12-29 16:40:20 +02:00
Pekka Enberg
01737a51a9 tracing: Add seastar/core/scollectd.hh include
Fix the following build breakage:

FAILED: build/release/gen/cql3/CqlParser.o
g++ -MMD -MT build/release/gen/cql3/CqlParser.o -MF build/release/gen/cql3/CqlParser.o.d -std=gnu++1y -g  -Wall -Werror -fvisibility=hidden -pthread -I/home/penberg/scylla/seastar -I/home/penberg/scylla/seastar/fmt -I/home/penberg/scylla/seastar/build/release/gen  -march=nehalem -Ifmt -DBOOST_TEST_DYN_LINK -Wno-overloaded-virtual -DFMT_HEADER_ONLY -DHAVE_HWLOC -DHAVE_NUMA -DHAVE_LZ4_COMPRESS_DEFAULT  -O2 -DBOOST_TEST_DYN_LINK  -Wno-maybe-uninitialized -DHAVE_LIBSYSTEMD=1 -I. -I build/release/gen -I seastar -I seastar/build/release/gen -c -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp
In file included from ./query-request.hh:31:0,
                 from ./locator/token_metadata.hh:51,
                 from ./locator/abstract_replication_strategy.hh:29,
                 from ./database.hh:26,
                 from ./service/storage_proxy.hh:44,
                 from ./db/schema_tables.hh:43,
                 from ./db/system_keyspace.hh:46,
                 from ./cql3/functions/function_name.hh:45,
                 from ./cql3/selection/selectable.hh:48,
                 from ./cql3/selection/writetime_or_ttl.hh:45,
                 from build/release/gen/cql3/CqlParser.hpp:63,
                 from build/release/gen/cql3/CqlParser.cpp:44:
./tracing/tracing.hh:357:5: error: ‘scollectd’ does not name a type
     scollectd::registrations _registrations;
     ^~~~~~~~~

Message-Id: <1482939751-8756-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit a443dfa95e)
2016-12-28 19:16:30 +02:00
Pekka Enberg
9753a39284 Update seastar submodule
* seastar 0b98024...1e45fc8 (1):
  > Merge "migrate network related seastar collectd metrics to the new metrics registration API" from Vlad
2016-12-28 17:06:06 +02:00
Avi Kivity
42a76567b7 dht: use nonwrapping_ranges in ring_position_range_sharder
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges.  Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>

(cherry picked from commit 8686a59ea5)
2016-12-27 19:16:26 +02:00
Avi Kivity
725949e8bf Merge "Fixes for intentional short reads" from Paweł
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.

I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.

It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."

* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
  query: introduce result_memory_accounter::foreign_state
  storage_proxy: fix short reads in parallel range queries
  storage_proxy: pass maximum result size to replicas
  mutation_partition: use result limiter for digest reads
  query: make result_memory_limiter constants available for linker
  result_memory_limiter: add accounter for digest reads
  idl: allow writers to use any output stream
  result_memory_limiter: split new_read() to new_{data, mutation}_read()
  idl: is_short_read() was added in 1.6
  mutation_partition: honour allowed_short_read for static rows
  storage_proxy: fix _is_short_read computation
  storage_proxy: disallow short reads if got no live rows
  storage_proxy: don't stop after result with no live rows

(cherry picked from commit 868b4d110c)
2016-12-27 19:16:15 +02:00
Amnon Heiman
231cf22c0e Set the prometheus prefix to scylla
This patch make the prometheus prefix configurable and set the default
value to scylla.

Fixes #1964

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>
(cherry picked from commit 70b2a1bfd4)
2016-12-27 17:57:08 +02:00
Takuya ASADA
ef0ffd1cbb dist/common/scripts/scylla_setup: improve the message of disk selection prompt
Not to confuse users, describe we only list up unmounted disks.

Fixes #1841

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479720708-6021-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 7c3b98806d)
2016-12-27 17:56:11 +02:00
Gleb Natapov
2b67c65eb6 messaging_service: move MUTATION_DONE messages to separate connection
If a node gets more MUTATION request that it can handle via RPC it will
stop reading from this RPC connection, but this will prevent it from
getting MUTATION_DONE responses for requests it coordinates because
currently MUTATION and MUTATION_DONE messages shares same connection.

To solve this problem this patches moves MUTATION_DONE messages to
separate connection.

Fixes: #1843

Message-Id: <20161201155942.GC11581@scylladb.com>
(cherry picked from commit 0a2dd39c75)
2016-12-27 17:55:01 +02:00
Piotr Jastrzebski
5b0971f82f mutation_partition: don't use unique_ptr to manage LSA objects
Unique_ptr won't destruct them correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>
(cherry picked from commit 3e502de153)
2016-12-27 17:54:45 +02:00
Raphael S. Carvalho
da843239bf sstables: fix calculation of memory footprint for summary
size of keys weren't taken into account, so value reported
via collectd is much smaller than actual footprint.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3ca24612e4e84d1cbdea4f2d79e431a4f4479291.1482255327.git.raphaelsc@scylladb.com>
(cherry picked from commit e28537b56f)
2016-12-27 17:54:35 +02:00
Avi Kivity
5c96b04f4d Revert "config, dht: reduce default msb ignore bits to 4"
This reverts commit b81a57e8eb.

With exponential range scanning, we should now be able to survive
msb ignore bits of 12, which allows better sharding on large clusters.

(cherry picked from commit 3989e4ed15)
2016-12-27 17:54:09 +02:00
Avi Kivity
a1d463900f storage_proxy: handle range scans of sparsely populated tables
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard.  With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.

Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).

If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation.  This
optimizes for the dense(r) case, where few partial results are required.

If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys.  This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.

We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.

[tgrabiec: trivial conflicts]

Message-Id: <20161220170532.25173-1-avi@scylladb.com>
(cherry picked from commit a1cafed370)
2016-12-27 16:57:18 +02:00
Avi Kivity
66598a68d5 tests: adjust mutation_query_test for partition and row limits
Won't build otherwise.

(cherry picked from commit b740aff777)
2016-12-27 16:53:35 +02:00
Avi Kivity
8ad0e96025 Merge "storage_proxy: Enforce row limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. Uppers layers like the pager may
already be depending on this behavior."

* 'enforce-row-limit/v3' of https://github.com/duarten/scylla:
  query_pagers: Don't trim returned rows
  select_statement: Don't always trim result set
  query_result_merger: Limit rows
  mutation_query: to_data_query_result enforces row limit

(cherry picked from commit 3421ebe8be)
2016-12-27 16:36:41 +02:00
Avi Kivity
1a2a63787a Merge "storage_proxy: Enforce partition limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."

* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
  storage_proxy: Decrease limits when retrying command
  storage_proxy: Don't fetch superfluous partitions
  query::result: Add partition count
  column_family: Use counters in query::result::builder
  query_result_builder: Use the underlying counters
  mutation_partition: Count partitions in query_compacted
  mutation_partition: Remove tabs in query_compacted
  query::result::builder: Add partition count
  query_result_merger: Limit partitions

(cherry picked from commit 6bb875bdb7)
2016-12-27 16:36:27 +02:00
Avi Kivity
52e5706147 Point seastar submodule at scylla-seastar.git
Allow separate management of Scylla 1.6's version of sestar.
2016-12-27 16:33:22 +02:00
Pekka Enberg
cec07ea366 release: prepare for 1.6.rc0 2016-12-27 12:41:34 +02:00
Benoît Canet
cbe729415e scylla_setup: Use blkid or ls to list potentials block devices
blkid does not list root raw device.

Revert to lsblk while taking care of having a fallback
path in case the -p option is not supported.

Fixes #1963.

Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20161225100204.13297-1-benoit@scylladb.com>
(cherry picked from commit a24ff47c63)
2016-12-27 12:34:56 +02:00
Raphael S. Carvalho
b5190f9971 db: avoid excessive disk usage during sstable resharding
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.

Fixes #1952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
(cherry picked from commit 27fb8ec512)
2016-12-27 12:19:25 +02:00
Tomasz Grabiec
7739456ec2 sstables: Fix double close on index and data files when writing fails
file output streams take the responsibility of closing the file, they
will close the file as part of closing the stream.

During sstable writing we create sstable object and keep file
references there as well. Sstable object also has responsibility for
closing the files, and does so from sstable::~sstable().

Double close was supposed to be avoided by a construct like this:

  writer.close().get();
  _file = {};

However if close() failed, which can happen when write-ahead failed,
_file would not be cleared, and both the writer and sstable would
close the file. This will result in a crash in
append_challenged_posix_file_impl::close(), which is not prepared to
be closed twice.

Another problem is that if exception happened before we reached that
construct, we still should close the writer. Currently we don't, so
there's no double close on the file, but that's a bug which needs to
be fixed and once that's fixed double close on _file will be even more
likely.

The fix employed here is to not keep files inside sstable object when
writing. As soon as the writer is constructed, it's the only owner of
the file.

Fixes #1764.

Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f2a63270d1)
2016-12-27 11:11:51 +02:00
Takuya ASADA
e4da0167d4 dist/redhat: don't try to adduser when user is already exists
Currently we get "failed adding user 'scylla'" on .rpm installation when user is already exists, we can skip it to prevent error.

Fixes #1958

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1482550075-27939-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit f3e45bc9ef)
2016-12-27 09:47:52 +02:00
Vlad Zolotarov
a3266060e3 tracing: don't start tracing until a Tracing service is fully initialized
RPC messaging service is initialized before the Tracing service, so
we should prevent creation of tracing spans before the service is
fully initialized.

We will use an already existing "_down" state and extend it in a way
that !_down equals "started", where "started" is TRUE when the local
service is fully initialized.

We will also split the Tracing service initialization into two parts:
   1) Initialize the sharded object.
   2) Start the tracing service:
      - Create the I/O backend service.
      - Enable tracing.

Fixes issue #1939

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 62cad0f5f5)
2016-12-21 12:49:41 +02:00
Glauber Costa
f6c83f73ef track streaming and system virtual dirty memory
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.

After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
(cherry picked from commit 7133583797)
2016-12-21 12:44:30 +02:00
Avi Kivity
293876c72f Merge "Limit number of readers streaming uses" from Paweł
"Original, naive db::make_streaming_reader() implementation created a set
of memtable and sstable readers for every partition range. This caused
bad interaction with the code limiting sstable readers concurrency and
was suboptimal.

This series introduces multi range mutation reader that takes mutation
source and a sorted, disjoint vector of ranges. It creates only a single
set of memtable and sstable readers and fast forwards it to the next
range once the current one is completed."

* 'pdziepak/multi-range-reader/v1' of github.com:cloudius-systems/seastar-dev:
  db: use multi range reader for streaming readers
  dht: describe split_range[s]_to_shards() guarantees
  repair: remove outdated fixme
  test/mutation_reader_test: add multi_range_reader test
  tests/mutation_reader: extract key creation code
  mutation_reader: add multi_range_reader
2016-12-15 17:48:31 +02:00
Paweł Dziepak
cf679a413c db: use multi range reader for streaming readers
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.

The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
2016-12-15 13:54:43 +00:00
Paweł Dziepak
b86a826baf dht: describe split_range[s]_to_shards() guarantees
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
2016-12-15 13:07:32 +00:00
Paweł Dziepak
5287417136 repair: remove outdated fixme 2016-12-15 13:07:32 +00:00
Paweł Dziepak
5b0cf20f75 test/mutation_reader_test: add multi_range_reader test 2016-12-15 13:07:32 +00:00
Paweł Dziepak
787a976c2b tests/mutation_reader: extract key creation code 2016-12-15 13:07:32 +00:00
Paweł Dziepak
52a4e79210 mutation_reader: add multi_range_reader
So far, the only way to combine outputs of multiple readers was to use
combining reader. It is very general and, in particular, supports case
when the readers emit mutations from overlapping ranges.

However, we have cases (e.g. streaming) when we need to read from
several disjoint ranges. Combining reader is a suboptimal solution as it
requires to creating a reader for each range and ignores the fact that
they do not overlap.

This patch introduces multi_range_mutation_reader which takes a
mutation_source and a sorted set of disjoint ranges. Internally, it uses
mutation_reader::fast_forward_to() to move to the next range once the
current one is completed.
2016-12-15 13:07:31 +00:00
Pekka Enberg
06c5216c9d Merge "Improve gossip feature logging" from Asias 2016-12-15 10:36:54 +02:00
Asias He
e578e65103 gossip: Log feature enabled message on shard zero only
Feature is per node. No need to log them number of shards times.
2016-12-15 16:33:11 +08:00
Asias He
4137fab91b gossip: Make log in check_features debug level
We saw the message twice for the same feature check. This is a bit
confusing.

INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}

This is because

   ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE);
   ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE);

The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE)
is constructed. The second message is printed when the
ss._range_tombstones_feature is copy-constructed.
2016-12-15 16:33:10 +08:00
Asias He
2b1ebc4719 gossip: Introduce gms:features::enable helper
Add the helper function to enable the a feature and log the feature is
enabled.

When a feature is enabled, we see

INFO  2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled
INFO  2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled

in the log.
2016-12-15 16:33:10 +08:00
Paweł Dziepak
b70e5d2089 Merge seastar upstream
Submodule seastar 6fbd792..0b98024:
  > fstream: fix read ahead byte metric types
  > fstream: add read-ahead metrics
  > future-util: make stop_iteration use bool_class<>
  > util: introduce bool_class<Tag>
2016-12-14 15:01:13 +00:00
Avi Kivity
57f4910832 Merge "Query result size limiting" from Paweł
"This series makes Scylla limit size of query results it produces in case they
grow unreasonably large. This is possible because CQL paging queries do not
guarantee that the returned page is going to have page_size rows and pages
smaller than tha *do not* indicate end of stream. Non-paged queries and Thrift
requests do not have such flexibility and they also get all the requested data
(though their memory usage is still accounted for and may limit paged queries).

There is a maximum result size (1 MB) and all results builders will stop after
reaching it. Moreover, there is a per-shard limitation on the amount of memory
used by all results combined (10%). To avoid tiny results a query has to
reserve (wait if necessary) 4 kB before starting executing, after that it can
consume more memory without any additional waiting provided it is below
individual and shard-local limits.

Enabling the cluster to return less rows than requested also means some changes
for the coordinator. Firstly, if it receives such short result from a replica
retrying it with a larger limit obviously makes no sense whatsoever. Instead,
in such cases the coordinator removes the clustering rows it has incomplate
information about and sends short result back to the client. Moreover, even
if no replica returned short response reconciliation may have made it so. In
this case, the coordinator do not necessairly need to retry the query as well.
Unfortunately, with the current implementation short responses ruin data
queries since they will cause a digest mismatch.

Three new metrics were added:
 * database_bytes_total_result_memory -- total memory used by query results
 * database_total_operations_short_data_queries -- data queries that were
   limited by size, particulary bad as it basically forces coordinator to
   retry them as mutation queries
 * database_total_operations_short_mutation_queries -- mutation queries limited
   by size"

* 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: clean up after primary_key introduction
  cql3: allow short reads with paged queries
  storage_proxy: handle intentional short reads
  storage_proxy: make sure coordinator has complete data
  storage_proxy: honour partition limit
  storage_proxy: use cmd limits to determine that replica reached end
  db: add metrics for short reads and memory used for results
  data_query: limit result size
  mutation_query: limit result size
  db: create result_memory_accounters when starting query
  query_builder: add partition_slice getter
  reconcilable_result: keep result_memory_tracker object
  mutation_compactor: honour stop_iteration from consumers
  db: add result_memory_limiter
  query: add result size limiter
  reconcilable_result: properly propagate short_read flag
  query_pagers: handle short reads properly
  query: allow short reads
  serializer_impl: add serializer for bool_class<Tag>
2016-12-14 16:53:07 +02:00
Paweł Dziepak
4c69d7e2fe storage_proxy: clean up after primary_key introduction
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dde4bd5051 cql3: allow short reads with paged queries
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
3c173d87b5 storage_proxy: handle intentional short reads
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dd67de7218 storage_proxy: make sure coordinator has complete data
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).

However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
2ff5308d8e storage_proxy: honour partition limit
At the moment the coordinator does not care much for the partition
limit. In particular it doesn't check whether after reconciliation the
result still contains enough partitions.

This patch makes it honour the partition limit and increase it in the
retried queries if necessary.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
7bed7aa7de storage_proxy: use cmd limits to determine that replica reached end
Coordinator may retry a query with larger limits. However, code
determining whether replica has no more data always used the original
limits. This may cause a livelock.

For example, consider cluster having the following partitions (deletions
cover live cells):

node1:
pk=0, v=0
pk=1, v=1

node2
delete pk=0
delete pk=1
pk=2, v=2
pk=3, v=3

Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is
going to send partitions 0 and 1 while second node is going to send 2
and 3 + tombstones for 0 and 1. The coordinator will decide that it
needs to retry the request with larger row limit since node1 may have
some information about partitions 2 and 3 that are newer than what node2
has sent.

However, when the second response arrives node1 will still sent only two
rows since it has no more data. Because the coordinator uses original
row limit it will not notice that this node reached the end and we are
going to get another retry without making any progress.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
cfd4d0f680 db: add metrics for short reads and memory used for results
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
ba51e7e8db data_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
f1b9f49f2b mutation_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
6c33a4f177 db: create result_memory_accounters when starting query
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.

Moreover, range queries need a way of merging memory usage information
from different shards.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
0bce4047bd query_builder: add partition_slice getter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
15de8de9e5 reconcilable_result: keep result_memory_tracker object
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
34f9eb4cbd mutation_compactor: honour stop_iteration from consumers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
5d7185fd39 db: add result_memory_limiter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
ee89d80d5c query: add result size limiter
This patch introduces an infrastrucutre for limiting result size.

There is a shard-local limit which makes sure that all results combined
do not use more than 10% of the shard memory.
There is also an invidual limit which restricts a result to 4 MB.
In order

In order to avoid sending tiny results there is minimum guaranteed size
(4 kB), which the query needs to reserve before it starts producing the
result.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
43fe3439ca reconcilable_result: properly propagate short_read flag
reconcilable_result can be merged with another or transformed into
query::result. Make sure that short_read information is never lost.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
837d24f1b2 query_pagers: handle short reads properly
Currently, the paging implementation assumes that the server retunrs
either as many rows as it was asked for all reached the end. Soon,
that's not going to be true so instead of making any assumptions about
the number of the rows returned use the new "short read" flag to
determine whether there is going to be more data.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
da7ca85040 query: allow short reads
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Paweł Dziepak
7a15c89b1d serializer_impl: add serializer for bool_class<Tag>
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Takuya ASADA
8918a4be57 dist/common/scripts/scylla_setup: don't abort scylla_setup when each setup script failed
Instead of abort scylla_setup, print warning message then continue to next setup.

Fixes #1357

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481713664-18429-1-git-send-email-syuu@scylladb.com>
2016-12-14 13:31:50 +02:00
Tomasz Grabiec
c9344826e9 tests: Remove unintentional enablement of trace-level logging
Sneaked in by mistake.
2016-12-14 10:58:07 +01:00
Tomasz Grabiec
fe6a70dba1 tests: commitlog: Fix assumption about write visibility
The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.

Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.

Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
2016-12-14 11:29:33 +02:00
Avi Kivity
a61ff53150 Merge "rework flush criteria" from Glauber
"The current criteria for memtable flush is not being respected.  The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.

More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.

To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.

Fixes #1918"

* 'flush-criteria-v6' of github.com:glommer/scylla:
  config: get rid of memtable_total_space
  database: rework dirty memory hierarchy
  system keyspace: write batchlog mutation in user memory
  database: remove flush_token
  database: abstract pressure condition notification
  database: encapsulate semaphore_units into a flush_permit
  database: remove friendship declaration
  database: simplify flush_one
  database: make memtable_list aware in cases it can't flush
2016-12-14 11:24:10 +02:00
Takuya ASADA
c18a95cddf dist/redhat: add scylla_lib.sh to scylla.spec
Fix .rpm build error.

Fixes #1932

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481703992-9596-1-git-send-email-syuu@scylladb.com>
2016-12-14 10:27:37 +02:00
Glauber Costa
56df53f51e compaction_manager: fix shutdown sequence
By the time we are able to acquire this semaphore, we may be stopped
already. So we need to test it before we go ahead. I can see shutdown
hangs before this patch that are fixed with it applied.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <e5b378893128d086d584ffbb2acd3fb687648e5c.1481655433.git.glauber@scylladb.com>
2016-12-14 09:26:24 +01:00
Glauber Costa
2aa6514667 config: get rid of memtable_total_space
Those values are now statically set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 17:05:12 -05:00
Glauber Costa
80440c0d79 database: rework dirty memory hierarchy
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.

That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.

This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors

To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.

Fixes #1918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 14:07:53 -05:00
Glauber Costa
db7cc3cba8 system keyspace: write batchlog mutation in user memory
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:35 -05:00
Glauber Costa
be9e4c71ad database: remove flush_token
We had a flush_token structure in addition to the flush_permit because
we needed to keep a pointer to the dirty_memory_manager and apply
changes to the region group upon the region destruction. Since Tomek's
latest series, this is no longer needed and now this structure doesn't
have a place in the world anymore. Simplify the code by removing it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
98030ad66c database: abstract pressure condition notification
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
c9a8b03311 database: encapsulate semaphore_units into a flush_permit
We will soon need to hold more than a semaphore_units<> object per
flush, potentially.

Preparation patch for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
2e8c7d2c62 database: remove friendship declaration
Not needed anymore since memtable started having a direct pointer to the
memtable list.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
bb1509c21e database: simplify flush_one
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.

It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
8ab7c04caa database: make memtable_list aware in cases it can't flush
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.

We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.

It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Takuya ASADA
0a6312d254 dist/common/scripts/scylla_ntp_setup: fix incorrect usage of is_debian_variant
Use it as "if is_debian_variant; then".
Fixes #1931

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481644262-29383-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:42 +02:00
Takuya ASADA
ed4cd1908f dist/common/scripts/scylla_selinux_setup: correct CentOS/RHEL detection
CentOS/RHEL is using SELinux, and it's NOT Debian variant, so fixed from
"is_debian_variant" to "! is_debian_variant".

Fixes #1930

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643873-28984-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:29 +02:00
Takuya ASADA
6c0dc55495 dist/common/scripts/scylla_selinux_setup: to use is_debian_variant(), need to source /usr/lib/scylla/scylla_lib.sh
This fixes following command not found error:
```
/usr/sbin/scylla_selinux_setup: line 7: is_debian_variant: command not found
```

Fixes #1929

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643308-28637-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:13 +02:00
Takuya ASADA
3b74c50546 dist/ubuntu: add uuidgen to package dependency
We haven't added uuidgen to Ubuntu/Debian package dependency, so scylla_setup
script may abort because of command not found.

Fixes #1928

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481642385-27941-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:28:48 +02:00
Duarte Nunes
1e75a4950e database: Complete query when hitting partition limit
Currently, we weren't completing a query as early as possible if it
reached the partition limit, we instead had to wait until reaching the
end of the specified partition ranges. This patches fixes that by
including a check to the partition limit in the termination condition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>

Message-Id: <20161213114559.26438-1-duarte@scylladb.com>
2016-12-13 14:53:46 +02:00
Tomasz Grabiec
f451014785 schema: Implement operator<< for column_mapping
Message-Id: <1481310679-14074-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:20:46 +02:00
Tomasz Grabiec
059a1a4f22 db: Fix commitlog replay to not drop cell mutations with older schema
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.

During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.

Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.

Fixes #1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:19:32 +02:00
Avi Kivity
32d55bbb4c Merge seastar upstream
* seastar 0773e98...6fbd792 (2):
  > tls: Only run our "verify" function in client session
  > Merge "Clean the metric definition" from Amnon

Includes patch from Amnon adjusting the metrics registration due to seastar
API changes.
2016-12-13 12:17:14 +02:00
Avi Kivity
6f9c317b91 Merge "Use uuid file in housekeeping" from Amnon
"This patch adds the use of uuid file to the housekeeping daily version check.
uuid file are optional, if a file is missing no uuid will be used."
2016-12-13 10:52:44 +02:00
Avi Kivity
c67782f169 Merge seastar upstream
* seastar 0a74317...0773e98 (6):
  > tls: Add support for client cetrificate verification & priority strings
  > semaphore: add consume_units
  > semaphore: add available_units()
  > thread: check need_preempt for threads in a scheduling group as well
  > tutorial: fix semaphore example, and text
  > stop_iteration: add && and || operators
2016-12-12 18:06:19 +02:00
Avi Kivity
c801cc4bd1 Merge "streaming and repair updates" from Asias
"This series:
- We can make reader with ranges
- Fix possible use after free of 'si'
- Streaming ranges now are sorted and merged
- Fix shard_begin shard_end end loop in both streaming and repair"
2016-12-12 11:32:42 +02:00
Asias He
ba54654af3 streaming: Use interval_set to sort and merge ranges
So that the ranges are sorted and have no overlaps. We can have less
ranges to deal with and it can help the mutation readers to optimize.

Here is an exmaple of ranges generated by repair:

Before:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    before ranges = {(-3383928698815274642, -3376937163195039606],
    (-7260764223708720005, -7251657821052234309], (-4767213984179237293,
    -4747032371925842389], (-7645879646119667643, -7589962743703481776],
    (-2340199306656526861, -2320523117224780931], (-576028861239229331,
    -560973674020019962], (-4070378863644120252, -3987599893827407860],
    (-2551584407739673151, -2498779102482524711], (-5416061903556353312,
    -5354212455975869358], (37594980457713898, 67885601051654285],
    (3083778975065200884, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (5765437476661317163, 5778671293583720541],
    (5960610072466058818, 5972289771228014343], (7749618183851698485,
    7758080813117351135], (-3987599893827407860, -3899198931034439776],
    (-7251657821052234309, -7131649010279865221], (-3576581915808403133,
    -3383928698815274642], (-417850207760366422, -327959672080599465],
    (-2671876682129336880, -2551584407739673151], (-1305178847032904465,
    -1137497074548854552], (8540448858050275827, 8610171849752115483],
    (-560973674020019962, -417850207760366422], (-2498779102482524711,
    -2340199306656526861], (2394447940525988167, 2523396860109747637],
    (-6703329224557608009, -6517757811218772762], (-3675103288021821677,
    -3576581915808403133], (-5622185785296846551, -5416061903556353312],
    (8610171849752115483, 8742605005068551458], (8068079250973315241,
    8185655671734937642], (560264964510741191, 790641981923757238],
    (5581202487214475094, 5765437476661317163], (8742605005068551458,
    8923908282731801645], (-6038176423022601107, -5622185785296846551],
    (5778671293583720541, 5960610072466058818], (-3899198931034439776,
    -3675103288021821677], (8356739976149429222, 8540448858050275827],
    (-6517757811218772762, -6038176423022601107], (-8052600134279395253,
    -7645879646119667643], (-327959672080599465, 37594980457713898],
    (7758080813117351135, 8019254284118543066], (4781565016737645510,
    5067070718000527886], (2523396860109747637, 3083778975065200884],
    (-5354212455975869358, -4767213984179237293], (6784138025918878582,
    7190719703944308372], (67885601051654285, 447405341661896387],
    (-2190610927722759275, -1305178847032904465], (-4747032371925842389,
    -4070378863644120252]}, size=48

After:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    after  ranges = {(-8052600134279395253, -7589962743703481776],
    (-7260764223708720005, -7131649010279865221], (-6703329224557608009,
    -3376937163195039606], (-2671876682129336880, -2320523117224780931],
    (-2190610927722759275, -1137497074548854552], (-576028861239229331,
    447405341661896387], (560264964510741191, 790641981923757238],
    (2394447940525988167, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (4781565016737645510, 5067070718000527886],
    (5581202487214475094, 5972289771228014343], (6784138025918878582,
    7190719703944308372], (7749618183851698485, 8019254284118543066],
    (8068079250973315241, 8185655671734937642], (8356739976149429222,
    8923908282731801645]}, size=15
2016-12-12 11:09:26 +08:00
Asias He
e523803a5d token_metadata: Introduce interval_to_range helper
It is used to convert a boost::icl::interval<token> interval back to a
range<token>.
2016-12-12 11:09:26 +08:00
Asias He
af3d76e6ac repair: Fix a typo in the log
sucessfully -> successfully
2016-12-12 11:09:26 +08:00
Asias He
374324e6fb repair: Fix shard_begin and shard_end
A range now alternates between different shards: the first part of the
range goes to shard X, the next to shard X+1, but after a while we go
back to shard X. So we can't do a simple loop between shard_begin and
shard_end.

Fix by using the newly introduced dht::split_range_to_shards

Use the cf.make_streaming_reader with ranges to simplify the code a bit.
2016-12-12 11:09:26 +08:00
Asias He
1987264beb streaming: Make streaming reader with ranges
Now that we have the new interface to make readers with ranges, we can
simplify the code a lot.

1) Less readers are needed
before: number of ranges of readers
after: smp::count readers at most

2) No foreign_ptr is needed
There is no need to forward to a shard to make the foreign_ptr for
send_info in the first phase and forward to that shard to execute the
send_info in the second phase.

3) No do_with is needed in send_mutations since si now is a
lw_shared_ptr

4) Fix possible user after free of 'si' in do_send_mutations
We need to take a reference of 'si' when sending the mutation with
send_stream_mutation rpc call, otherwise:
   msg1 got exception
   si->mutations_done.broken()
   si is freed
   msg2 got exception
   si is used again
The issue is introduced in dc50ce0ce5 (streaming: Make the mutation
readers when streaming starts) which is master only, branch 1.5 is not
affected.
2016-12-12 09:04:21 +08:00
Asias He
463cc4fbde dht: Introduce split_ranges_to_shards
Split a ranges into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
044c4ff44c dht: Introduce split_range_to_shards
Split a range into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
cd2105b8bd database: make_streaming_reader for ranges
Allow to make a streaming reader with a vector of ranges in addition to
a single range. This will be used soon in following streaming patch.

We can make the reader more efficient later.
2016-12-12 09:04:21 +08:00
Duarte Nunes
ada2f1092e dht: Make i_partitioner::tri_compare pure virtual
This patch makes the i_partitioner::tri_compare() function pure
virtual as it is overridden by all partitioners.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211172037.16496-1-duarte@scylladb.com>
2016-12-11 19:29:37 +02:00
Duarte Nunes
bb66b051ed dht: Make i_partitioner::tri_compare memory safe
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
2016-12-11 18:58:10 +02:00
Amnon Heiman
08dcd8cb4a scylla housekeeping ubuntu service: use uuid file
This patch adds uuid file support for ubuntu system. It also split the
behaviour between restart and daily checks. The first run in r mode and
the second in d mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:35:07 +02:00
Amnon Heiman
6fef24aaf0 housekeeping systemd service: use uuid file
This set the housekeeping systemd service to use a uuid file and use
daily mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:02:16 +02:00
Amnon Heiman
17b8306bc4 scylla-housekeeping support uuid file
Allows scylla-housekeeping getting the uuid from a file instead of the
command line.

If the file is missing no uuid will be used.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:00:34 +02:00
Avi Kivity
299d1fad0b Merge "reduce bloom filter overhead in compaction" from Raphael
"Function to calculate maximum purgeable timestamp is made 10 times faster when
compacting sstables overlap with 10% of all sstables.
That's possible with an incremental selector that will incrementally select
sstables based on key being compacted.
Currently, we iterate through all non-compacting sstables and consult their
bloom filter to determine max purgeable timestamp, and that will be very
expensive for compactions that are frequently deciding whether or not to purge
tombstones."

* 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla:
  compaction: reduce bloom filter overhead with incremental selector
  tests: add test for sstable set's incremental selector
  sstable_set: introduce incremental selector
  compatible_ring_position: add function to return token
2016-12-11 09:46:58 +02:00
Glauber Costa
5803957ab5 compaction: fix build
Commit 732ee275 moved tracking of one statistics value inside a lambda
without capturing this in that lambda. Compilation fails as a result.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <68860640f4533dd43e43f341f1620e25464b700b.1481313455.git.glauber@scylladb.com>
2016-12-10 09:00:20 +02:00
Raphael S. Carvalho
fcfc84e836 compaction: reduce bloom filter overhead with incremental selector
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.

Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.

Fixes #1322.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
548f6066c5 tests: add test for sstable set's incremental selector
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
02541e15c1 sstable_set: introduce incremental selector
Incrementally select sstables from sstable set using token
in ascending order.
For leveled strategy, it returns all sstables that belong
to current interval. For other strategies, it just return
all sstables from the set.
Useful for compaction which needs all sstables that overlap
with key being currently compacted to calculate maximum
purgeable timestamp.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:16 -02:00
Glauber Costa
9b5e6d6bd8 commitlog: correctly report requests blocked
The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.

That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.

This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
2016-12-09 15:02:26 +02:00
Raphael S. Carvalho
732ee275f8 compaction: fix running compaction counter when splitting sstables
The counter was being increased before taking the semaphore, so
every pending split would count as a running compaction which
misleads the user as a result.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <f2050cc3599cee7af29d4579368a154708b37731.1481248048.git.raphaelsc@scylladb.com>
2016-12-09 15:01:43 +02:00
Raphael S. Carvalho
453620a316 compatible_ring_position: add function to return token
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-08 14:25:29 -02:00
Avi Kivity
872b5ef5f0 sstables: fix probe with Unknown component
Commit 53b7b7def3 ("sstables: handle unrecognized sstable component")
ignores unrecognized components, but misses one code path during probe_file().

Ignore unrecognized components there too.

Fixes #1922.
Message-Id: <20161208131027.28939-1-avi@scylladb.com>
2016-12-08 15:24:25 +01:00
Glauber Costa
733d87fcc6 database: try to acquire semaphore before we start flush
As Tomek pointed out, as we are starting the flush before we acquire the
semaphore, we are not really limiting parallelism, but only delaying the
end of the flush instead.

Fixes #1919

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <6cbf9ec2f3a341c76becf94f794cfa16539c5192.1481120410.git.glauber@scylladb.com>
2016-12-08 12:18:32 +01:00
Tomasz Grabiec
3511bf4a81 Merge branch 'tgrabiec/memtable-gentle-clearing' from seastar-dev.git
When row cache is disabled, update_cache() will do nothing to the
memtable. Active readers may keep the memtable alive for unbounded
amount of time, preventing it from going away. This doesn't play well
with virtual dirty accounting. Soon before calling update_cache(), the
memory which was subtracted during flush is added back to the amount
of virtual dirty memory. If there was write pressure all along, we
will be at the dirty memory limit. When we give back subtracted memory
this will put virtual dirty way above the limit. This will stall all
writes until another memtable flush drags virtual dirty down or
readers finally release the memtable. We want to prevent upward
jumps of virtual dirty.

First part of the fix is to ensure that as long as the memtable's
region is in the dirty group, we will not revert flushed memory. This
must happen synchronously from region's memory being removed from the
group in order to prevent upward virtual dirty jumps. To make this
easier, tracking of flushed memory was moved to the memtable object.

Another part of the fix is to gradually clear the memtable when cache
is disabled in a similar fashion as when it's moved to cache. This
ensures that the actual memory held by memtable's region is released
sooner than it dies.

Refs #1879
2016-12-08 12:18:32 +01:00
Gleb Natapov
a05516f14c storage_proxy: wire up range_slice_timeouts, range_slice_unavailables and read_unavailables counters
Message-Id: <20161206105154.GL1866@scylladb.com>
2016-12-08 11:42:52 +02:00
Avi Kivity
5530a61975 stables: fix build with older boost (boost::variant::get<T&>)
Older boost doesn't support boost::variant::get<T&> (where the type
parameter is reference qualified); remove (unneeded anyway).
2016-12-08 10:56:05 +02:00
Pekka Enberg
0bc3ce7e09 Merge "sstables: remove sharding metadata from Statistics component" from Avi
"Due to my misreading of Cassandra code, I thought it would ignore new
components in the Statistics component; however, it doesn't, and the change
(introduced in bdd11648ac ("sstables: add
intra-node sharding metadata") breaks sstable2json and likely any
Cassandra code that touches sstables.

To fix, move the sharding data into a new component ("Scylla.db"), which
Cassandra does ignore.  The new component is designed to be extensible so
we don't experience the same issue later on."
2016-12-08 10:14:07 +02:00
Avi Kivity
7f26f9c0f9 Merge "repair refactor and fix" from Asias
* tag 'asias/repair/subranges/refactor_fix/v1' of github.com:cloudius-systems/seastar-dev:
  repair: Limit the number of sub ranges
  repair: Use estimated_keys_for_range in repair_cf_range
  repair: Extract the target_partitions into repair_info class
  repair: Put request_transfer_ranges into repair_info class
  repair: Introduce check_failed_ranges helper
  repair: Introduce do_streaming helper
  repair: Make the neighbors const reference
  repair: Introduce repair_info
  repair: Attach the repair id in the stream plan name
2016-12-08 10:06:39 +02:00
Tomasz Grabiec
f7197dabf8 commitlog: Fix replay to not delete dirty segments
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.

The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.

Introduced in abe7358767.

dtest failure:

 commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup

Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
2016-12-07 15:54:47 +02:00
Avi Kivity
4fedbf8430 Merge "service::storage_proxy: rework collectd counters registration" from Vlad
- Add "coordinator" and "replica" categories
   - Use a new seastar/metrics_registration framework

* 'rearrange-storage-proxy-stats-v4' of github.com:cloudius-systems/seastar-dev:
  service::storage_proxy: rework the collectd counters registration
  service/storage_proxy: regroup collectd statistics
2016-12-07 15:38:40 +02:00
Avi Kivity
3c3a18f222 sstables: move sharding metadata from Statistics component to a new Scylla component
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
2016-12-07 15:20:13 +02:00
Avi Kivity
24140ec8c6 sstables: add support for sets of discriminated union types
Allow declaring discriminated unions (with an enum type as the
discriminant and any sstable serializable type as a value) and sets
of these unions, with the disciminant as the key.  Parsers and writers
are auto-generated.
2016-12-07 13:27:52 +02:00
Avi Kivity
e0cce9d299 Merge "streaming: Improve logging" from Asias
"This seires adds streaming bandwidth and streaming plan name to the log when
streaming is finished."
2016-12-07 12:21:47 +02:00
Amos Kong
f32f7993cc systemd: reset housekeeping timer at each start
Currently housekeeping timer won't be reset when we restart scylla-server.
We expect the service to be run at each start, it will be consistent with
upstart script in Ubuntu 14.04

When we restart scylla-server, housekeepting timer will also be restarted,
so let's replace "OnBootSec" with "OnActiveSec".

Fixes: #1601

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <a22943cc11a3de23db266c52fd476c08014098c4.1480607401.git.amos@scylladb.com>
2016-12-06 18:33:37 +02:00
Takuya ASADA
5a5ab51254 dist/ubuntu/dep: fix incorrect file path to detect previously built .deb existance check
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-4-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
6dd6b868a6 scripts/scylla_install_pkg: support Debian
Supported Debian on installation script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-3-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
7f2df8f86e dist/common/scripts: introduce scylla_lib.sh
To reduce duplicated code and simplified scripts introduce scylla_lib.sh
for shellscripts which provides functions to classify distributions,
and load all sysconfig files.

This also fixes script bugs to misdetect Debian and RHEL.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-2-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
8464903021 dist/common/systemd/scylla-housekeeping.timer: workaround to avoid crash of systemd on RHEL 7.3
RHEL 7.3's systemd contains known bug on timer.c:
https://github.com/systemd/systemd/issues/2632

This is workaround to avoid hitting bug.

Fixes #1846

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480452194-11683-1-git-send-email-syuu@scylladb.com>
2016-12-06 10:48:28 +02:00
Takuya ASADA
b2c0059da3 dist/common/scripts/scylla_coredump_setup: use systemd-coredump on Ubuntu 16.04
Ubuntu 16.04 has systemd-coredump, better to use it.

Fixes #1916

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480679267-30844-1-git-send-email-syuu@scylladb.com>
2016-12-05 17:09:38 +02:00
Takuya ASADA
2976799ef2 main: fix startup failing on Ubuntu 15.10/16.04
Since Ubuntu 15.10/16.04 still uses Upstart to manage GUI session (not as init), when we directly launch Scylla on Ubuntu's GUI Terminal(not using systemctl or initctl), raise(SIGSTOP) mistakenly calls (Because GUI session has "UPSTART_JOB" environment variable, won't happen when running Scylla as systemd service).

To avoid this, we need to verify UPSTART_JOB == "scylla-server".
If it's part of GUI session UPSTART_JOB has to be "unity7", we need to avoid raise(SIGSTOP) in that case.

Fixes #1199

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480620421-28967-1-git-send-email-syuu@scylladb.com>
2016-12-05 16:28:25 +02:00
Tomasz Grabiec
527ff6aa40 db: Clear memtable after flush when cache is disabled
So that memory is released gradually (impacting latency less) and
sooner than when memtable is destroyed. Active readers may keep the
memtable alive for unbounded amount of time.

Refs #1879
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1bba51319e memtable: Maintain virtual dirty on clear()
When memtable is flushing, it subtracts _flushed_memory from groups's
size to gradually allow more writes. Ideally _flushed_memory would be
equal to region's size when flush ends, so the group's size would
reach zero. When the memtable and its region are gone the group size
should remain the same as after the flush. This is ensured by adding
back _flushed_memory to group's size right before the region is
removed from the group.

Calling clear() before region is removed from the group breaks the
accounting because it will shrink the region, but will not affect the
amount of memory subtracted due to _flushed_memory. So group's size
would decrease more than we want (twice the region's size). The fix is
to change clear() so that it reverts _flushed_memory by the amount by
which the region size is reduced. This will keep the groups's size
constant as long as _flushed_memory > 0.
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1b5f338c17 memtable: Track flushed memory in memtable object 2016-12-05 12:59:09 +01:00
Tomasz Grabiec
c3768fe4de memtable: Pass dirty_memory_manager& to memtable constructor
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:

  boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));

This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
2016-12-05 12:59:09 +01:00
Asias He
00d7a35949 utils: Put crc32 under utils namespace
It conflicts with crc in zlib
Message-Id: <1480918984-4117-2-git-send-email-asias@scylladb.com>
2016-12-05 11:48:29 +02:00
Takuya ASADA
54ea0055fc dist/common/scripts/node_exporter_install: use curl instead of wget
CentOS/Ubuntu contains curl on minimal instllation but wget doesn't,
and we already has dependency for curl, so we should switch to curl.

Fixes #1902

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480929047-2347-1-git-send-email-syuu@scylladb.com>
2016-12-05 11:26:36 +02:00
Asias He
86c2620b7a gossip: Skip stopping if it is not started
If exception is triggered early in boot when doing an I/O operation,
scylla will fail because io checker calls storage service to stop
transport services, and not all of them were initialized yet.

Scylla was failing as follow:
scylla: ./seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local()
[with Service = gms::gossiper]: Assertion `local_is_initialized()' failed.
Aborting on shard 0.
Backtrace:
  0x000000000048a2ca
  0x000000000048a3d3
  0x00007fc279e739ff
  0x00007fc279ad6a27
  0x00007fc279ad8629
  0x00007fc279acf226
  0x00007fc279acf2d1
  0x0000000000c145f8
  0x000000000110d1bc
  0x000000000041bacd
  0x00000000005520f1
  0x00007fc279aeaf1f
Aborted (core dumped)

Refs #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <963f7b0f5a7a8a1405728b414a7d7a6dccd70581.1479172124.git.asias@scylladb.com>
2016-12-05 09:42:37 +02:00
Asias He
49229964d0 streaming: Add streaming plan name when session is failed
Before:
[shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000]
Stream failed, peers={127.0.0.1, 127.0.0.2}

After:
[shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000]
Stream failed for streaming plan repair-in-29, peers={127.0.0.1, 127.0.0.2}
2016-12-05 08:20:18 +08:00
Asias He
1c47e26913 streaming: Add streaming plan name when all sessions are completed
Before:
[shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000]
All sessions completed, peers={127.0.0.2}

After:
[shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000]
All sessions completed for streaming plan repair-in-32, peers={127.0.0.2}
2016-12-05 08:20:18 +08:00
Asias He
984f427cb5 streaming: Log streaming bandwidth
It looks like:

[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.1 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.2 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.3 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] bytes_sent = 393284364, bytes_received = 0, tx_bandwidth = 17.048 MiB/s, rx_bandwidth = 0.000 MiB/s
[Stream #f3907fd0-a557-11e6-a583-000000000000] All sessions completed, peers={127.0.0.1, 127.0.0.2, 127.0.0.3}

Fixes #1826
2016-12-05 08:20:18 +08:00
Asias He
4ae5781e40 repair: Limit the number of sub ranges
A range is diveded into N sub ranges so that each sub range contains 100
partitions. So N depends on the number of partitions in that range.  N
can grow unbounded and the memory usage of vector to hold these sub
ranges can go unbouded.

Limit the max number of sub ranges a range can divided into.

The downside is that the limited sub range will make we include more
partitions in the checksum.

Fixes #1917
2016-12-05 08:12:48 +08:00
Asias He
d850b86145 repair: Use estimated_keys_for_range in repair_cf_range
Use the newly introduced interface to estimate number of partitions in
the range.
2016-12-05 08:05:07 +08:00
Asias He
7b63cbbe0d repair: Extract the target_partitions into repair_info class
We can tune the number on a per repair basis.
2016-12-05 08:05:07 +08:00
Asias He
d9b689321e repair: Put request_transfer_ranges into repair_info class 2016-12-05 08:05:07 +08:00
Asias He
7741393059 repair: Introduce check_failed_ranges helper
To check if there is any failed ranges and log it.
2016-12-05 08:05:07 +08:00
Asias He
f8d7aa597b repair: Introduce do_streaming helper
To execute the stream_plans to sync data between nodes.
2016-12-05 08:05:07 +08:00
Asias He
d0a6290d4f repair: Make the neighbors const reference
We do not modify it. Make it const reference.
2016-12-05 08:05:07 +08:00
Asias He
6d0f6c1a99 repair: Introduce repair_info
To reduce the number of parameters we pass around. Simplify the code a
little bit.
2016-12-05 08:05:06 +08:00
Asias He
9be5170c07 repair: Attach the repair id in the stream plan name
So that we know which repair id this stream plan belongs to.
2016-12-05 08:05:06 +08:00
Tomasz Grabiec
d496dfeced Update seastar submodule
* seastar 7790e68...0a74317 (2):
  > core/reactor: Move definitions out of #ifndef
  > Add systemtap-sdt-devel to fedora dependencies

Fixes #1915.
2016-12-02 10:49:17 +01:00
Vlad Zolotarov
e5e7ac1bd4 service::storage_proxy: rework the collectd counters registration
Use the new seastar's metrics_registration framework:
   - Change the registration syntax.
   - Add a long description for each counter.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-12-01 22:38:09 -05:00
Vlad Zolotarov
3bf12e4ffc service/storage_proxy: regroup collectd statistics
Instead of putting all statistics under the same "storage_proxy" category
separate them into 2 groups according to where the corresponding counters
are updated:
   - "storage_proxy_replica"
   - "storage_proxy_coordinator"

Fixes #1763

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-12-01 22:27:47 -05:00
Glauber Costa
99a5a77234 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>
2016-12-01 13:20:46 +01:00
Tomasz Grabiec
570fc0008b scylla-gdb: Fix lookup of symbols in 'scylla ptr'
Message-Id: <1480529617-26564-1-git-send-email-tgrabiec@scylladb.com>
2016-12-01 12:33:29 +02:00
Raphael S. Carvalho
b30a2cb21a lcs: generate info that preserves token distribution in higher levels
The information (last compacted keys) is lost after node is restarted
or schema is updated, which causes strategy to be rebuilt.
We need it for strategy to guarantee uniform distribution of token
range across sstables, or we could end up with 1 sstable of level L
overlapping with lots of sstables of level L+1, and that results in
a compaction of undesired length.
That information can be generated from scratch by getting last key
of newest sstable in each level > 0.

Fixes #1906.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <35ebd15977d5a8418239febb160c796cdc0e98fa.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:58 +02:00
Raphael S. Carvalho
38743c1948 sstables: provide write time of data component
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <59686148149f2159990329775e0cd8780bc54254.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:57 +02:00
Glauber Costa
d7256e7b21 database: do not call seal directly from the streaming timer
Streaming memtable have a delayed mode where many flushes are coalesced
together into one, with the actual flush happening later and propagated
to all the previous waiters.

However, the timer that triggers the actual flush was not using the
newly introduced flush infrastructure. This was a minor problem because
those flushes wouldn't try to take the semaphore, and so we could have
many flushes going on at the same time.

What was a potential performance issue became a correctness issue when
we moved the reversal of the dirty memory accounting out of
revert_potentially_cleaned_up_memory() into remove_from_flush_manager().

Since the latter is only called through the flush infrastructure, it
simply wasn't called. So the deferral of the reversal exposed this bug.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>
2016-11-30 18:00:55 +01:00
Tomasz Grabiec
c35e18ba12 tests: Fix use-after-free on commitlog
Only shutdown() ensures all internal processes are complete. Call it before calling clear().

Message-Id: <1480495534-2253-1-git-send-email-tgrabiec@scylladb.com>
2016-11-30 11:03:26 +02:00
Avi Kivity
281b4c64ea Update ami submodule
* dist/ami/files/scylla-ami 25e101f...d5a4397 (1):
  > scylla_install_ami: allow specify different repository for Scylla installation and receive update
2016-11-29 19:26:49 +02:00
Takuya ASADA
17ef5e638e dist/ami: allow specify different repository for Scylla installation and receive update
This fix splits build_ami.sh --repo to three different options:
 --repo-for-install is for Scylla package installation, only valid
 during AMI construction.

 --repo-for-update will be stored at /etc/yum.repos.d/scylla.repo, to
 receive update package on AMI.

 --repo is both, for installation and update.

Fixes #1872

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480438858-6007-1-git-send-email-syuu@scylladb.com>
2016-11-29 19:26:07 +02:00
Avi Kivity
5ea235e3e8 Merge "Prevent overloading memory with timed out writes" from Tomasz
"The goal of this series is to prevent unbounded memory use
in cases when requests are timing out. Write requests which timed
out may still occupy memory for a while because of local mutation
application. This memory is not accounted for and can build up.

First part of the fix changes local mutation application so that it times out
at about the same time as the request handler. Then the life
time of the request handler is extended to cover any background activity
of that request which hasn't timed out yet. This has two main effects:

  (1) by timing out local writes we prevent build up of background activity
      for timed out requests

  (2) we ensure that memory used by background activity is not left
      behind unaccounted for. This will prevent CQL server from admitting
      more requests than memory usage limit allows.

Fixes #1756."

* tag 'tgrabiec/prevent-oom-on-timeouts-v5' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: Do not flood logs with timeout errors
  database: Add counter for timed out writes
  storage_proxy: Delay timeout response until background work ceases
  storage_proxy: Propagate timeout to local writes
  storage_proxy: Use shared ownership for abstract_write_response_handler
  storage_proxy: Add counter for all alive write handlers
  db: Allow writes to be timed out
  db: Introduce counters for failed reads and writes
  commitlog: Allow allocations to be timed out
  utils/logalloc: Add ability to timeout run_when_memory_available() task
  utils/flush_queue: Add ability to wait with a timeout
2016-11-29 18:55:52 +02:00
Avi Kivity
28a5ff51cb dist: add build dependency on systemtap-sdt
Needed to newer seastar.
2016-11-29 18:49:51 +02:00
Tomasz Grabiec
48bbd6733c storage_proxy: Do not flood logs with timeout errors
Timeout errors are flooding the log after local mutate can time
out. We don't log remote mutate timeouts, so for consistency we won't
log local ones as well.

There is a database counter for timed out writes which can be
consulted in order to check if they're occuring.

Perhaps this would be better solved by a generic log message
throttling/coalescing mechanism, but that's not ready yet.
2016-11-29 16:40:59 +01:00
Tomasz Grabiec
b5d5612f98 database: Add counter for timed out writes 2016-11-29 16:40:59 +01:00
Tomasz Grabiec
14cb31f69a storage_proxy: Delay timeout response until background work ceases
Write requests which timed out may still occupy memory for a while due
to local write. It should time out soon as well but there is a time
window in which it has not yet. If we don't delay timeout response,
the request would be seen as not consuming any memory too early. This
in turn would cause the CQL server to allow more requests than we
want. In some cases causing OOM or exceeding memory limits and causing
excessive cache eviciton.

Fixes #1756.
2016-11-29 16:40:59 +01:00
Tomasz Grabiec
ba3779802f storage_proxy: Propagate timeout to local writes 2016-11-29 16:40:59 +01:00
Tomasz Grabiec
6d195a1538 storage_proxy: Use shared ownership for abstract_write_response_handler 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
5805330d98 storage_proxy: Add counter for all alive write handlers
Currently the counter uses _response_handlers.size(), but after later
patches we may have an active (timed out) write with no response
handler, so count live instances instead.
2016-11-29 16:40:58 +01:00
Tomasz Grabiec
2c561ecaed db: Allow writes to be timed out 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
b1ae6ad2ad db: Introduce counters for failed reads and writes 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
31645e2c4a commitlog: Allow allocations to be timed out 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
e14caaef60 utils/logalloc: Add ability to timeout run_when_memory_available() task 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
61d81617e1 utils/flush_queue: Add ability to wait with a timeout 2016-11-29 16:40:58 +01:00
Raphael S. Carvalho
a16425833c size_tiered: do not recreate bucket when it goes beyond max threshold
Problem will cause size tiered to return small jobs when there are
more than max_threshold sstables of similar size. For example, if
max_threshold is 32, and there are 36 sstables of similar size,
strategy will only return 4 sstables to be compacted. That's because
we incorrectly create a new bucket when it meets the max threshold.
What we should do is to allow buckets to grow beyond max threshold
and trim them when selecting the most suitable one for compaction.

Important to mention that estimation for size tiered will now
work better when there are more than max_threshold sstables of
similar size.

Fixes #1901.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <080bad70d6cb86eaf52ac1bdd6765ac47aab5b03.1478316140.git.raphaelsc@scylladb.com>
2016-11-29 16:56:02 +02:00
Glauber Costa
353a4cd2d4 commitlog: sync segments before acquiring semaphore on shutdown.
Sync all segments before acquiring the semaphore, otherwise waiting may
have to wait for the timer to kick in and push them down.
Note that we can't guarantee that no other requests were executed in the
mean time, so we have to sync again.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>
2016-11-29 11:07:28 +02:00
Tomasz Grabiec
96c7764458 Revert "prevent commitlog replay position reordering during reserve refill"
This reverts commit 0e9b75d406.

commitlog_test fails with this:

Running 14 test cases...
ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed
ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required)

*** 1 failure detected in test suite "tests/commitlog_test.cc"
WARN  2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
2016-11-28 20:52:13 +01:00
Raphael S. Carvalho
f141b0cdae database: atomically add new sstables to cf when refreshing
New sstables are loaded and added in parallel, meaning that scylla can
potentially return stale data if a new sstable containing a tombstone
wasn't loaded yet. Compaction should also not run until all new sstables
are added for similar reasons.

Fix is about separating blocking and non-blocking steps to allow
atomic add of multiple new sstables.

Fixes #1368.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>
2016-11-28 20:30:48 +02:00
Glauber Costa
0e9b75d406 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>
2016-11-28 19:26:26 +01:00
Takuya ASADA
1042e40188 dist/common/scripts/scylla_kernel_check: fix incorrect document URL
Fixes #1871

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480327243-18177-1-git-send-email-syuu@scylladb.com>
2016-11-28 13:51:19 +02:00
Avi Kivity
18df2d9e9e partition_version: fix const correctness in rows_entry_compare
Using a non-const-correct comparator results in build failures with
boost 1.55.

Fixes #1892.
Message-Id: <20161128104335.28789-1-avi@scylladb.com>
2016-11-28 10:55:12 +00:00
Avi Kivity
5358984982 Merge seastar upstream
* seastar 93c3b12...7790e68 (7):
  > core/reactor: Introduce reactor-*/dervie-busy_ns metric
  > Collectd: Hold a reference to the metrics implementation in registration
  > future: Improve comments
  > fstream: actually use dynamically adjusted buffer
  > debug: add latency detector script
  > reactor: add static probes for latency detector
  > semaphore: Fix with_semaphore() in case wait() throws
2016-11-28 11:05:59 +02:00
Avi Kivity
28857e42e7 Merge " Virtualize size_estimates system table" from Duarte
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.

This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.

Fixes #1616"

* 'virtual-estimates/v4' of github.com:duarten/scylla:
  size_estimates_virtual_reader: Add unit test
  db: Delete size_estimates_recorder
  size_estimates: Add virtual reader
  column_family: Add support for virtual readers
  storage_service: get_local_tokens() returns a future
  nonwrapping_range: Add slice() function
  range: Find a sequence's lower and upper bounds
  system_keyspace: Build mutations for size estimates
  size_estimates: Store the token range as bytes
  range_estimates: Add schema
  murmur3_partitioner: Convert maximum_token to sstring
2016-11-28 10:12:59 +02:00
Avi Kivity
176fca5775 logalloc: use correct header for unique_ptr
<bits/unique_ptr.hh> is a libstdc++ internal header.  USe <memory> instead.
2016-11-27 23:08:04 +02:00
Glauber Costa
c32803f2f0 database: move reversion of virtual dirty state closer to update_cache.
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.

During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.

This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
2016-11-24 18:18:15 +01:00
Raphael S. Carvalho
4781b6eb71 sstables: use nonwrapping_range::make to avoid compilation issues
GCC 5.3.1 was unable to convert bound to optional<bound>.

sstables/sstables.cc:2494:123: error: no matching function for call to
‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position,
dht::ring_position)’
(dtr.right.exclusive ? dht::ring_position::starting_at :
dht::ring_position::ending_at)(std::move(t2)));

In file included from ./dht/i_partitioner.hh:52:0,
                 from ./query-request.hh:28,
                 from ./clustering_key_filter.hh:27,
                 from sstables/sstables.hh:35,
                 from sstables/sstables.cc:38:
./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range(
const wrapping_range<U>&) [with T = dht::ring_position]
     explicit nonwrapping_range(const wrapping_range<T>& r)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>
2016-11-24 11:26:16 +02:00
Duarte Nunes
cc3f26c993 lz4: Conditionally use LZ4_compress_default()
Since not all distributions have a version of LZ4 with
LZ4_compress_default(), we use it conditionally.

This is specially important beginning with version 1.7.3 of LZ4,
which deprecates the LZ4_compress() function in favour of
LZ4_compress_default() and thus prevents Scylla from compiling
due to the deprecated warning.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161124092339.23017-1-duarte@scylladb.com>
2016-11-24 11:25:03 +02:00
Avi Kivity
1be95b1227 Merge seastar upstream
* seastar d6f26d8...93c3b12 (3):
  > rpc: Conditionally use LZ4_compress_default()
  > queue: allow queue to change its maximum size
  > util/defer: add missing return to move assignment
2016-11-24 11:00:53 +02:00
Duarte Nunes
a527ba285f thrift: Don't apply cell limit across rows
In Thrift, SliceRange defines a count that limits the number of cells
to return from that row (in CQL3 terms, it limits the number of rows
in that partition). While this limit is honored in the engine, the
Thrift layer also applies the same limit, which, while redundant in
most cases, is used to support the get_paged_slice verb.

Currently, the limit is not being reset per Thrift row (CQL3
partition), so in practice, instead of limiting the cells in a row,
we're limiting the rows we return as well. This patch fixes that by
ensuring the limit applies only within a row/partition.

Fixes #1882

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161123220001.15496-1-duarte@scylladb.com>
2016-11-24 10:38:31 +02:00
Takuya ASADA
ce80fb3a39 dist/ubuntu: increase number of open files on Ubuntu 14.04(upstart)
Follow the change of NOFILE for non-systemd environment.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479975050-14907-1-git-send-email-syuu@scylladb.com>
2016-11-24 10:13:41 +02:00
Avi Kivity
d58c8aaa32 db: remove unused belongs_to_{current,other}_shard(s) functions
Obsoleted by new sharding mechanism, but break the build for some.
2016-11-23 21:39:29 +02:00
Avi Kivity
b81a57e8eb config, dht: reduce default msb ignore bits to 4
With the default value of 12, a node's range is partitioned into
4096 * smp::count sub-ranges which are queried sequentually for a range
scan.  If the number of rows in the table is smaller than the required
result size, we will query all of them.  This can take so long that we
time out.

A better fix is to query multiple sub-ranges in parallel and merge them,
but for that we need to resurrect the non-sequential merger.
2016-11-23 21:25:37 +02:00
Pekka Enberg
c526a9f0be Update seastar submodule
* seastar 7473945...d6f26d8 (2):
  > semaphore_units: add missing return statement
  > metrics: Do not detroy the metrics layer if it is been used
2016-11-23 20:27:09 +02:00
Paweł Dziepak
919825a2c7 Merge "Improve sharding in large clusters" from Avi
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively.  Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.

This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes.  Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.

In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.

Fixes #1277."
2016-11-23 11:25:53 +00:00
Glauber Costa
18b9fa3d43 dist: increase number of open files
This limit was found to be too low for production environments. It would
be hit at boot, when we're touching a lot of files from multiple shards
before deciding that we don't need them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <87bbf43da1a67f5fa6174017205c6ef8bdb0dc3d.1479829232.git.glauber@scylladb.com>
2016-11-23 13:10:25 +02:00
Avi Kivity
07d5a20bae Wire up sharding ignore msb parameter to configuration
We might have used a fancy map<sstring, any> to pass the parameters, but
that's overkill for now.
2016-11-22 22:40:47 +02:00
Avi Kivity
8b1d689de8 partitioner: add ignore_msb parameters to byte ordered and random partitioners
Ignored; doesn't make sense on byte ordered, and random is deprecated.
2016-11-22 21:56:42 +02:00
Avi Kivity
af16c0fac4 murmur3_partitioner: shard on the middle token bits, not most significant bits
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.

This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy.  In effect, with changes the token range layout from

   shard 0
   shard 1
   ...
   shard S-1

to

   shard 0
   shard 1
   ...
   shard S-1

   shard 0
   shard 1
   ...
   shard S-1

   ...

   shard 0
   shard 1
   ...
   shard S-1

Where the number of repetitions of the block is 2^(ignored msb bits).

For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
2016-11-22 21:56:42 +02:00
Avi Kivity
024c8ef8a1 db: adjust sstable load to use sstable self-reporting of shard ownership
Instead of calculating the owning shard from the sstable's partition
key range, delegate to the new sstable method for getting owning shard
infomation.  This insulates us from changes in the sharding algorithm.
2016-11-22 21:56:40 +02:00
Avi Kivity
98a4544e1c sstables: add method to get sstable owning shards from an unloaded sstable
When we load an sstable, we don't know beforehand which shards it belongs
to; we don't want to open it until we do.  Add a method that allows us
to read just the sharding data, without opening anything else.
2016-11-22 21:52:23 +02:00
Avi Kivity
bdd11648ac sstables: add intra-node sharding metadata
Add a metadata component that describes token ranges that are spanned by
this sstable.  With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
2016-11-22 21:44:25 +02:00
Avi Kivity
316ef1d70a sstables: automate writing statistics components
Add a virtual funnction to metadata_base so we can loop over statistics
components when writing them.
2016-11-22 21:05:06 +02:00
Glauber Costa
13973e7f3b keep background work semaphore alive during sstable flush
We have a semaphore controlling the amount of background work generated
by the memtable flush process. However, because we are not moving it
inside the memtable post-flush continuation, the units are being
released when we star the flush and not when we finish it.

That's not the intended behavior and that can cause flushes to
accumulate.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>
2016-11-22 19:54:08 +01:00
Avi Kivity
d05b22e502 sstables: automatically calculate offsets in statistics
Instead of calculating the offset for each statistic component manually,
use a loop to iterate over all components, accumulating the offset as we
go along.
2016-11-22 20:35:24 +02:00
Avi Kivity
7c5e6525ef sstables: switch statistics components to generic serialized_size() implementation 2016-11-22 20:20:38 +02:00
Avi Kivity
096ae59a5b sstables: introduce generic serialized_size()
Introduce a new function that reuses the file_writer code to compute
the serialized size of an sstable object, by serializing it into memory
and discarding the result.
2016-11-22 20:06:23 +02:00
Avi Kivity
3c06ffac9d sstables: const correctness for the write(file_writer&, T&) functions
write() doesn't need to change its input; so change it to const.

The only snag is that describe_type() isn't and can't be made const-correct,
so cheat when it is called and const_cast the input.

This helps in writing a generic serialized_size() that is const correct,
in the next patch.
2016-11-22 20:04:27 +02:00
Tomasz Grabiec
eefc538225 Update seastar submodule
* seastar 7504026...7473945 (1):
  > Merge "Improve support for timeouts in primitives"
2016-11-22 17:51:29 +01:00
Glauber Costa
0b8b5abf16 commitlog: acquire semaphore earlier
Recently we have changed our shutdown strategy to wait for the
_request_controller semaphore to make sure no other allocations are
in-flight. That was done to fix an actual issue.

The problem is that this wasn't done early enough. We acquire the
semaphore after we have already marked ourselves as _shutdown and
released the timer.

That means that if there is an allocation in flight that needs to use a
new segment, it will never finish - and we'll therefore neve acquire
the semaphore.

Fix it by acquiring it first. At this point the allocations will all be
done and gone, and then we can shutdown everything else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>
2016-11-21 22:19:32 +00:00
Avi Kivity
6bdb8ba31d storage_proxy: don't query concurrently needlessly during range queries
storage_proxy has an optimization where it tries to query multiple token
ranges concurrently to satisfy very large requests (an optimization which is
likely meaningless when paging is enabled, as it always should be).  However,
the rows-per-range code severely underestimates the number of rows per range,
resulting in a large number of "read-ahead" internal queries being performed,
the results of most of which are discarded.

Fix by disabling this code. We should likely remove it completely, but let's
start with a band-aid that can be backported.

Fixes #1863.

Message-Id: <20161120165741.2488-1-avi@scylladb.com>
2016-11-21 18:19:46 +02:00
Glauber Costa
0ca8c3f162 database: keep a pointer to the memtable list in a memtable
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.

Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.

That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.

Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.

At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.

The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.

Fixes #1870

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
2016-11-21 18:18:27 +02:00
Duarte Nunes
def2bc72b0 size_estimates_virtual_reader: Add unit test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
6a37d87c76 db: Delete size_estimates_recorder
Now that access to the size_estimates system is virtualized, we no
longer need the recorder.

Fixes #1616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
225648780d size_estimates: Add virtual reader
This patch add a virtual mutation_reader so that queries
to the size_estimates system table are handled by the engine
without needing to perform any IO.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
cd7e2fd602 column_family: Add support for virtual readers
Virtual readers allow queries to selected tables, usually system
tables, to be answered by the engine. This is useful for tables which
aren't written by users and whose contents can be calculated on
demand.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
c0d450c57d storage_service: get_local_tokens() returns a future
This patch changes the get_local_tokens() function in storage_service
to return a future instead of requiring running under a seastar::thread.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
9b384d375f nonwrapping_range: Add slice() function
This patch add the slice() function to nonwrapping range, which uses
its bounds to slice an input sequence.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
bdba8d99c3 range: Find a sequence's lower and upper bounds
This patch extracts a pair of functions from mutation_partition to
calculate the lower and upper bounds of a sequence from a
nonwrapping_range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
636287fdf2 system_keyspace: Build mutations for size estimates
This patch adds a function to system_keyspace responsible for creating
a mutation to a partition of the size_estimates system table from a
set of range_estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
18ddec245e size_estimates: Store the token range as bytes
This patch changes the range_estimates struct so that the tokens are
represented as utf8 encoded bytes. This will make future patches
require less conversions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:14:21 +00:00
Duarte Nunes
e7a5162c1d range_estimates: Add schema
This will be used in future patches, when virtualizing the
size_estimates system table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Duarte Nunes
01815ecd24 murmur3_partitioner: Convert maximum_token to sstring
This patch ensures we can convert the maximum_token to an sstring.
For Cassandra, the minimum and maximum tokens have the same
representation. So, we use the string representation of the
maximum_token for the maximum_token.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Takuya ASADA
eee63027e5 dist/ami/build_ami.sh: update base AMI to CentOS7-Base5
To drop unnecessary .ssh/authozied_keys, we need to update base AMI.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479496938-29724-1-git-send-email-syuu@scylladb.com>
2016-11-21 10:12:47 +02:00
Avi Kivity
783729c540 Merge "Clean up T::memory_usage() function" from Paweł
"This series is just a cleanup which intention is to deal with all
confusion related to the way T::memory_usage() functions work.

* T::memory_usage() which returned external memory usage are renamed
  to T::external_memory_usage()
* T::memory_usage() is introduced where needed to avoid repeating
  sizeof(T) + T::external_memory_usage()"

Paweł Dziepak (6):
  rename memory_usage() to external_memory_usage() where applicable
  streamed_mutation: add memory_usage() to mutation fragment types
  keys: add memory_usage()
  partition_snapshot_accounter: use range_tombstone::memory_usage()
  mutation_rebuilder: use memory_usage()
  frozen_mutation: use memory_usage()
2016-11-21 10:11:39 +02:00
Avi Kivity
498887ca0d Merge seastar upstream
* seastar 31c5fd7...7504026 (2):
  > circular_buffer: add move assignment operator
  > scollectd: Fix serialization of GAUGE-typed values
2016-11-20 20:16:56 +02:00
Gleb Natapov
9222a47fed sstable test: add test for generated summary data
Message-Id: <20161117155051.GV6765@scylladb.com>
2016-11-20 19:50:45 +02:00
Glauber Costa
21c1e2b48c commitlog: wait for pending allocations to finish before closing gate.
allocations may enter the gate, so it would be wise for us to wait for them.

Fixes #1860

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <53cd6996c1cbd8b38bab3b03604bd11e5c20beda.1479650012.git.glauber@scylladb.com>
2016-11-20 19:45:33 +02:00
Avi Kivity
a39b92a40a build: fix tests-with-symbols generation
Bad indentation caused the libs variable for tests-with-symbols to be
overwritten, resulting in link failure.
2016-11-20 17:23:26 +02:00
Glauber Costa
504b5ac30f database: don't check for waiters in the condition variable predicate.
In the last iterations of this patchset, we have moved explicit flushes
to acquire the semaphore directly and the coalescing inside the
memtable_list. As a result, we are no longer keeping any kind of action
for them inside the condition variable. Checking for them has no longer
a purpose.

This is a cleanup patch that remove does checks.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>
2016-11-18 21:34:48 +01:00
Glauber Costa
1933349654 database: fix direct flushes of non-durable column families.
If a Column Family is non-durable, then its flushes will never create a
memtable flush reader. Our current flush logic depends on that being
created and destroyed to release the semaphore permits on the flush.

We will remove the permits ourselves it there is an exception, but not
under normal circumnstances. Given this issue, however, it would be more
adequate to always try to remove the permits after we flush. If the
permits were already removed by the flush reader, then this test will
just see that the permit is not in the map and return. But if it is
still there, then it is removed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>
2016-11-18 21:32:29 +01:00
Avi Kivity
6eecbc80dc CONTRIBUTING.md: add sections for help and issues
Don't scare away users reporting an issue with the CLA.
2016-11-18 22:21:10 +02:00
Glauber Costa
60b7d35f15 commitlog: close file after read, and not at stop
There are other code paths that may interrupt the read in the middle
and bypass stop. It's safer this way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8c32ca2777ce2f44462d141fd582848ac7cf832d.1479477360.git.glauber@scylladb.com>
2016-11-18 14:09:33 +00:00
Paweł Dziepak
249e0ab087 frozen_mutation: use memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
948c062e64 mutation_rebuilder: use memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
e04664e851 partition_snapshot_accounter: use range_tombstone::memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
711bd19f16 keys: add memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
6b8bf030c0 streamed_mutation: add memory_usage() to mutation fragment types
This patch introduces memory_usage() to static_row, clustering_row and
range_tombstone so that we can avoid repeating sizeof(T) +
x.external_memory_usage().

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
ef57b9a26f rename memory_usage() to external_memory_usage() where applicable
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Avi Kivity
fec4ef3390 Merge "Make sure commitlog replay is able to make progress" from Glauber
"Fixes #1856

Commitlog replay reads are being issued without a priority. That means
they will lose to compaction every time."

* 'issue-1856-v2' of github.com:glommer/scylla:
  commitlog: use read ahead for replay requests
  commitlog: use commitlog priority for replay
  commitlog: close replay file
2016-11-18 12:04:18 +02:00
Takuya ASADA
55e5123313 dist/redhat: Support RHEL7
We supported install CentOS7 .rpm on RHEL7, but we haven't supported
building on RHEL7, since there is little difference between CentOS,
and that causes build error.

This patch fixes the error, now we can produce .rpm for RHEL7 wihout
using CentOS.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479431134-8032-1-git-send-email-syuu@scylladb.com>
2016-11-18 11:56:05 +02:00
Glauber Costa
461778918b fix shutdown and exception conditions for flush logic
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
  the explicit flushes no longer wait on the condition variable. So we
  don't.
- We now wait on the stop() flushes (regardless of their return status)
  so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
  to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
  handler

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
2016-11-17 21:16:44 +01:00
Glauber Costa
59a41cf7f1 commitlog: use read ahead for replay requests
Aside from putting the requests in the commitlog class, read ahead
will help us going through the file faster.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:54 -05:00
Glauber Costa
aa375cd33d commitlog: use commitlog priority for replay
Right now replay is being issued with the standard seastar priority.
The rationale for that at the time is that it is an early event that
doesn't really share the disk with anybody.

That is largely untrue now that we start compactions on boot.
Compactions may fight for bandwidth with the commitlog, and with such
low priority the commitlog is guaranteed to lose.

Fixes #1856

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:02 -05:00
Glauber Costa
4d3d774757 commitlog: close replay file
Replay file is opened, so it should be closed. We're not seeing any
problems arising from this, but they may happen. Enabling read ahead in
this stream makes them happen immediately. Fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 12:35:24 -05:00
Avi Kivity
eaf83ab59c Merge seastar upstream
* seastar 3001c08...31c5fd7 (2):
  > Safe use of collectd during shutdown
  > udp: abort reader and writer when udp channel close
2016-11-17 18:44:28 +02:00
Piotr Jastrzebski
9d33948487 mutation_rebuilder: fix fragment size calculation
It wasn't calculating the size of data correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c03dfff7bf1ca3199991e5864189f98bfa2942ea.1479397736.git.piotr@scylladb.com>
2016-11-17 16:23:42 +00:00
Raphael S. Carvalho
3dc9294023 db: do not leak deleted sstable when deletion triggers an exception
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.

The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.

We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.

Fixes #1840.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
2016-11-17 17:46:36 +02:00
Gleb Natapov
c052a1bc4f sstable: use schema's min_index_interval config when generating missing summary
Message-Id: <20161116181937.GA25303@scylladb.com>
2016-11-17 15:24:03 +02:00
Avi Kivity
5d067eebf2 Merge "get rid of memtable size parameter and rework flush logic" from Glauber
"This patchset allows Scylla to determine the size of a memtable instead
of relying in the user-provided memtable_cleanup_threshold. It does that
by allowing the region_group to specify a soft limit which will trigger
the allocation as early as it is reached.

Given that, we'll keep the memtables in memory for as long as it takes
to reach that limit, regardless of the individual size of any single one
of them. That limit is set to 1/4 of dirty memory. That's the same as
last submission, except this time I have run some experiments to gauge
behavior of that versus 1/2 of dirty memory, which was a preferred
theoretical value.

After that is done, the flush logic is reworked to guarantee that
flushes are not initiated if we already have one memtable under flush.
That allow us to better take advantage of coalescing opportunities with
new requests and prevents the pending memtable explosion that is
ultimately responsible for Issue 1817.

I have run mainly two workloads with this. The first one a local RF=1
workload with large partitions, sized 128kB and 100 threads. The results
are:

Before:

op rate                   : 632 [WRITE:632]
partition rate            : 632 [WRITE:632]
row rate                  : 632 [WRITE:632]
latency mean              : 157.8 [WRITE:157.8]
latency median            : 115.5 [WRITE:115.5]
latency 95th percentile   : 486.7 [WRITE:486.7]
latency 99th percentile   : 534.8 [WRITE:534.8]
latency 99.9th percentile : 599.0 [WRITE:599.0]
latency max               : 722.6 [WRITE:722.6]
Total partitions          : 189667 [WRITE:189667]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

After:

op rate                   : 951 [WRITE:951]
partition rate            : 951 [WRITE:951]
row rate                  : 951 [WRITE:951]
latency mean              : 104.8 [WRITE:104.8]
latency median            : 102.5 [WRITE:102.5]
latency 95th percentile   : 155.8 [WRITE:155.8]
latency 99th percentile   : 177.8 [WRITE:177.8]
latency 99.9th percentile : 686.4 [WRITE:686.4]
latency max               : 1081.4 [WRITE:1081.4]
Total partitions          : 285324 [WRITE:285324]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

The other workload was the workload described in #1817. And the result
is that we now have a load that is very stable around 100k ops/s and
hardly any timeouts, instead of the 1.4 baseline of wild variations
around 100k ops/s and lots of timeouts, or the deep reduction of
1.5-rc1."

* 'issue-1817-v4' of github.com:glommer/scylla:
  database: rework memtable flush logic
  get rid of max_memtable_size
  pass a region to dirty_memory_manager accounting API
  memtable: add a method to expose the region_group
  logalloc: allow region group reclaimer to specify a soft limit
  database: remove outdated comment
  database: uphold virtual dirty for system tables.
2016-11-17 14:36:43 +02:00
Avi Kivity
18078bea9b storage_proxy: avoid calculating digest when only one replica is contacted
If we're talking to just one replica, the digest is not going to be used,
so better not to calculate it at all.  The optimization helps with
LOCAL_ONE queries where the result is large, but does not contain large
blobs (many small rows).

This patch adds a digest_algorithm parameter to the READ_DATA verb that
can take on two values: none and MD5 (default), and sets it to none when
we're reading from one replica.

In the future we may add other values for more hardware-friendly digest
algorithms.
Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>
2016-11-17 13:04:30 +02:00
Asias He
dc50ce0ce5 streaming: Make the mutation readers when streaming starts
Currenlty we make the mutation readers for streaming at different
time point, i.e.,

do_for_each(_ranges.begin(), _ranges.end(), [] (auto range) {
     make a mutation reader for this range
     read mutations from the reader and send
})

If there are write workload in the background, we will stream extra
data, since the later the reader is made the more data we need to send.

Fix it by making all the readers before starting to stream.

Fixes #1815
Message-Id: <1479341474-1364-2-git-send-email-asias@scylladb.com>
2016-11-17 12:41:53 +02:00
Gleb Natapov
ae0a2935b4 sstables: fix ad-hoc summary creation
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
2016-11-17 11:05:23 +02:00
Glauber Costa
f08162e181 database: rework memtable flush logic
The way we currently flush memtables, we seal the current one but wait
on a semaphore for the actual flush to proceed.

This is pointless, because if the flush is not proceeding we'll use up
memory for the new entries anyway, be them in a newly opened memtable or
not. As a matter of fact, by opening a new memtable we are foregoing
coalescing opportunities.

After recent changes to the flush paths, we are now in a position to do
differently. We move the semaphore earlier, and if we can't acquire it
we keep appending to the current memtable.

For explicit flushes, we'll queue and prioritize them over memory-based
flushes. This has the nice property of potentially coalescing various
flushes for the same CF into one.

Coalescing flushes for the same CF is particularly helpful for
commitlog-initiated flushes that can't complete within the flush period.
What we see currently, is that under heavy load the commitlog will keep
sealing memtables adding to the existing load.

Another interesting property of this approach is that we can keep the
disk utilization higher, by allowing a new flush to start before the
memtable is fully sealed. By design, every time a memtable is finished
flushing it will call revert_potentially_cleaned_up_memory() to revert
the virtual memory charges.  That is the perfect moment for us to act.
It indicates that all the data flushing part is done.

The way we'll do it is by keeping the semaphore_units alive for this
memtable. When the flush ends, we destroy that object. This will
effectively trigger the next flush if there is a next flush that can be
initiated.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:58 -05:00
Glauber Costa
895e838ac0 get rid of max_memtable_size
After recent changes to the memtable code, there is no reason for us to
uphold a maximum memtable size. Now that we only flush one memtable at a
time anyway, and also have soft limit notifications from the
region_group_reclaimer, we can just set the soft limit to the target
size and let all of that be handled by the dirty_memory_manager.

It does have the added property that we'll be flushing when we globally
reach the soft limit threshold. In conditions in which we have multiple
CF writes fighting for memory, that guarantees that we will start
flushing much earlier than the hard limit.

The threshold is set to 1/4 of dirty memory. While in theory we would
prefer the memtables to go as big as 1/2 of dirty memory, in my
experiments I have found 1/4 to be a better fit, at least for the
moment.

The reason for such behavior is that in situations where we have slow
disks, setting the soft limit to 1/2 of dirty will put us in a situation
in which we may not have finished writing down the memtable when we hit
the limit, and then throttle. When set the threshold to 1/4 of dirty, we
don't throttle at all.

This behavior could potentially be fixed by not doing the full
memtable-based throttling after we do the commitlog throttling, but that
is not something realistic for the moment.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
2ed3f342c1 pass a region to dirty_memory_manager accounting API
We would like to know from which region is a particular flush coming
from, and account accordingly. The reasoning behind that, is that soon
we'll be driving the flushes internally from the dirty_memory_manager
without explcitly triggering them.

We need to start a flush before the current one finishes, otherwise
we'll have a period without significant disk activity when the current
SSTable is being sealed, the caches are being updated, etc.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
0b337dab14 memtable: add a method to expose the region_group
That is technically not needed because a memtable inherits from group. So
whenever we have a memtable, we can use it's group() method to obtain a
group for it, and then from there go to the region_group.

However, region() is a const method in the memtable, so we have to play
trick with the const_cast, or remove the constness from the region. An
alternative to that, which I prefer, is to expose a method for the
region_group directly from the memtable object that does the right thing
and bypasses all that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
f86c9e36f4 logalloc: allow region group reclaimer to specify a soft limit
The region_group_reclaimer will let us know every time we are over the
limit we have specified for memory usage.

However, For some applications, we would be interested in knowing about
memory build up earlier, so we can start doing something about it before
we reach that condition.

This patch introduce soft limit notifications for the
region_group_reclaimer. After this patch is applied, start_reclaim() is
called earlier, and stop_reclaim() later, after the soft condition is
abated.

There are methods that allow one to easily test if the pressure
condition is a soft limit condition or a hard, threshold condition and
act accordingly. Whether to act on both conditions or just one of them
is up to the application.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
da738a6cd1 database: remove outdated comment
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
919de98aa5 database: uphold virtual dirty for system tables.
Currently the virtual dirty mechanism is not properly set for system
tables. We haven't divided the system table allowance by two, which
means it won't start thottling earlier as it was supposed to.

In practice, this has little effect because system table requests are
very well behaved, their sizes well known, and they tend to be
force-flushed. But we should be consistent.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Avi Kivity
f26c6569d2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 61ff5c6...25e101f (1):
  > scylla_install_ami: delete unneeded authorized_keys from AMI image
2016-11-16 22:36:31 +02:00
Takuya ASADA
3802f289f8 dist: remove bc from dependency
Since we replaced shellscript based cpuset generator with python based one,
we no longer depends to bc command.
d571123afd

So drop it from .rpm/.deb dependency.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479152876-11020-1-git-send-email-syuu@scylladb.com>
2016-11-16 15:02:55 +02:00
Amnon Heiman
a4be7afbb0 API: cache_capacity should use uint for summing
Using integer as a type for the map_reduce causes number over overflow.

Fixes #1801

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1479299425-782-1-git-send-email-amnon@scylladb.com>
2016-11-16 13:55:46 +01:00
Avi Kivity
31d0e31de2 Merge seastar upstream
* seastar 47e1821...3001c08 (5):
  > core: Introduce weak_ptr<>
  > timer: Add missing include
  > tutorial: fix TeX template
  > Merge "Adding the metrics layer" from Amnon
  > core/memory: let malloc(0) return a valid pointer
2016-11-16 14:20:49 +02:00
Pekka Enberg
8a4bd6ecd5 README: Guidelines for contributing
Message-Id: <1479288359-14168-1-git-send-email-penberg@scylladb.com>
2016-11-16 12:50:02 +02:00
Paweł Dziepak
f877be50b0 Merge "Keep wide partition cache entry longer than others" from Piotr
"Cache entries for wide partitions are usually smaller than other
entries and the cost of recreating them is higher so it makes sense
to keep them longer than ordinary entries."
2016-11-15 20:44:52 +00:00
Paweł Dziepak
b8d737ff0a tests/row_cache_test: verify that eviction follows lru
Refs #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:57:54 +01:00
Paweł Dziepak
999dafbe57 row_cache: touch entries read during range queries
Fixes #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479230809-27547-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:54:11 +01:00
Tomasz Grabiec
11c5f4ab50 storage_proxy: Add counters for throttled writes 2016-11-15 17:18:25 +01:00
Piotr Jastrzebski
5ec668c9c6 Add separate LRU for wide partitions.
Evict wide partitions only every 1000 normal partition
evictions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 16:19:13 +01:00
Piotr Jastrzebski
9a41bfbf69 Add collectd metric for wide partition evictions.
This will allow us to see how big is an amount
of evictions of cached info about wide partitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 15:53:14 +01:00
Paweł Dziepak
055d78ee4c query_pagers: distinct queries do not have clustering keys
Query pager needs to handle results that contain partitions with
possibly multiple clustering rows quite differently than results with
just one row per partition (for example a page may end in a middle of
partition). However, the logic dealing with partitions with clustering
rows doesn't work correctly for SELECT DISTINCT queries, which are
much more similar to the ones for schemas without clustering key.

The solution is to set _has_clustering_keys to false in case of SELECT
DISTINCT queries regardless of the schema which will make pager
correctly expect each partition to return at most one rows.

Fixes #1822.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478612486-13421-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 11:06:01 +01:00
Glauber Costa
93386bcec7 histograms: do not use latency_in_nano
Now that the histogram has its own unit expressed in its template
parameter, there is no reason to convert it to nano just so we may need
to convert it back if the histogram needs another unit.

This patch will keep everything as a duration until last moment, and
then we'll convert when needed.

This was suggested by Amnon.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>
2016-11-14 18:01:43 +02:00
Nadav Har'El
c5254b6502 repair: fix undefined variable
If the "trace" parameter of the repair was not given, we will use the
"trace" variable without setting it. We need to set a default value.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1479136239-14204-1-git-send-email-nyh@scylladb.com>
2016-11-14 17:16:19 +02:00
Raphael S. Carvalho
e86de40b49 compaction_manager: inform about compaction cancelled by shutdown
After some changes in compaction manager, user no longer is informed
that compaction was cancelled in event of shutdown. That's because
we only ignore ready future when compaction manager was asked to
stop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <02ca29b5a93fe3a558896598f325b0dce069e82c.1478277317.git.raphaelsc@scylladb.com>
2016-11-14 16:37:33 +02:00
Piotr Jastrzebski
4fe989d58e Cleanup sstables::mutation_reader::impl
Pointer to sstable seems unnecessary.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a45e8853af2b5f896ec44144fbc26d3325a5ec0c.1479123740.git.piotr@scylladb.com>
2016-11-14 11:52:52 +00:00
Avi Kivity
14c1b17105 storage_service: fix construct_range_to_endpoint_map with semi-infinite range
After the conversion to nonwrapping ranges, construct_range_to_endpoint_map()
may be called with semi-infinite token ranges, but it does not expect this,
calling nonwrapping_range::end()->value() unconditionally.

Fix by checking whether this is a semi-infinite range on the right, and
replace ->value() by maximum_token() instead.

Fixes `nodetool describering` (once more).
Message-Id: <1478983010-29630-1-git-send-email-avi@scylladb.com>
2016-11-14 11:39:48 +01:00
Raphael S. Carvalho
9a9f0d3a0f main: fix exception handling when initializing data or commitlog dirs
Exception handling was broken because after io checker, storage_io_error
exception is wrapped around system error exceptions. Also the message
when handling exception wasn't precise enough for all cases. For example,
lack of permission to write to existing data directory.

Fixes #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>
2016-11-14 12:34:10 +02:00
Takuya ASADA
d571123afd dist/common/scripts/scylla_sysconfig_setup: stop using 'bc' command to generate cpuset parameter, use python script instead
We get error from bc command when we run the script on >34 ncpus,
to prevent the issue add a python script to generate cpuset parameter.

Fixes #1824

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478887624-12737-1-git-send-email-syuu@scylladb.com>
2016-11-14 11:45:23 +02:00
Duarte Nunes
66f6a367a4 ring_position_range_sharder: Avoid copying eagerly
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161104115632.15974-1-duarte@scylladb.com>
2016-11-13 11:42:23 +02:00
Avi Kivity
bf20aa722b Merge "Fixes for histogram and moving average calculations" from Glauber
"JMX metrics were found to be either not showing, or showing absurd
values.  Turns out there were multiple things wrong with them. The
patches were sent separately but conflict with one another. This series
is a collection of the patches needed to fix the issues we saw.

Fixes #1832, #1836, #1837"
2016-11-13 11:16:32 +02:00
Avi Kivity
2670e46f3e storage_service: deinline most methods
Most inline methods in storage_service are too large to be inlined, and
just increase compile time.  De-inline them.
2016-11-12 21:12:28 +02:00
Glauber Costa
608d825790 histogram: fix reporting units
We are tracking latencies in microseconds, but almost everywhere else
they are reported in microseconds. Instead of just converting, this
patch tries to be a bit more future proof and embed the unit into the
type - and we then default to microseconds.

I have verified that the JMX measures now report sane values for both
the storage proxy and the column family. nodetool cfhistograms still
works fine. That one is reported in nanoseconds, but through the
estimated_histogram, not ihistogram.

Fixes #1836

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-11 11:36:56 -05:00
Glauber Costa
1342d044eb moving averages: change metrics calculation
We have recently fixed a bug due to which the constructor parameters for
moving average were inverted, leading to the numbers being just plain
wrong. However, the calculation of alpha was already inverted, meaning
it was right by accident and now that's wrong.

With the wrong alpha, the values we see are still correct, but they move
very quickly. The intention of this code is obviously to smooth things
out.

This was found out by Nadav. I have tested and confirmed that the smoothing
factor now works as expected.

Fixes  #1837

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-10 22:33:34 -05:00
Amnon Heiman
a977ea85e1 histogram: moving_average and total rate should be calculate in seconds
The moving average and the total average should be calculated in seconds
and not nanoseconds.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-11-10 22:32:53 -05:00
Glauber Costa
d3f11fbabf histogram: moving averages: fix inverted parameters
moving_averages constructor is defined like this:

    moving_average(latency_counter::duration interval, latency_counter::duration tick_interval)

But when it is time to initialize them, we do this:

	... {tick_interval(), std::chrono::minutes(1)} ...

As it can be seen, the interval and tick interval are inverted. This
leads to the metrics being assigned bogus values.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>
2016-11-10 11:28:51 -08:00
Paweł Dziepak
f16d6f9c40 partition_version: make sure that snapshot is destroyed under LSA
Snapshot destructor may free some objects managed by the LSA. That's why
partition_snapshot_reader destructor explicitly destroys the snapshot it
uses. However, it was possible that exception thrown by _read_section
prevented that from happenning making snapshot destoryed implicitly
without current allocator set to LSA.

Refs #1831.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478778570-2795-1-git-send-email-pdziepak@scylladb.com>
2016-11-10 13:13:10 +01:00
Gleb Natapov
27e041606b fix LOCAL_ONE printout
Message-Id: <20161109125307.GH7766@scylladb.com>
2016-11-09 12:53:55 +00:00
Duarte Nunes
e680587b8a sstable_test: Be explicit about uncompressed tables
After 7c28ed, the schemas defined in the test became compressed by
default. This patch changes the test so that it is explicit about
which schemas shouldn't define a compressor.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478646530-5558-1-git-send-email-duarte@scylladb.com>
2016-11-09 11:21:59 +02:00
Pekka Enberg
b3dea313dd Merge "API changes for Cassandra 3.x migration" from Calle
"Mostly small changes/additions to the API calls to match Cv3
 requirements/semantics, i.e. updated scylla-jmx can implement required
 nodetool etc calls in a working fashion."
2016-11-09 10:30:32 +02:00
Duarte Nunes
e33c02aa60 cql3: Disable compression on empty properties
The CQL 3.1 documentation specifies that for disabling compression,
users should use an empty string:

ALTER TABLE mytable WITH COMPRESSION = {'sstable_compression': ''};

However, Cassandra also accepts the absence of the sstable_compression
option to disable compression. The patch 7c28ed prevented this behavior
in Scylla, which this patch aims to fix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478639499-4183-1-git-send-email-duarte@scylladb.com>
2016-11-09 10:03:59 +02:00
Gleb Natapov
93f068bd44 storage_proxy: fix speculation target selection logic
Current speculation target selection logic has several bugs in multi-dc
setup. It may select a non local target for CL=LOCAL and it may select
more than one target to speculate, one of which is non local.

Examples:

1. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_QUORUM.

In this scenario db::filter_for_query() will return both replicas from
local DC and speculation target selection logic will peek one one which
will be in different DC.

2. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_ONE + RRD.DC_LOCAL

In this scenario db::filter_for_query() will return all nodes in local DC and
there already be enough nodes to speculate, but current logic will add
one node from non local dc as a speculation target.

The patch below fixed both of those scenarios.

Message-Id: <20161103154637.GS7766@scylladb.com>
2016-11-08 18:32:47 +01:00
Paweł Dziepak
a8308e2a8d row_cache: dummy entry does not count as partition
Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>
2016-11-08 13:54:44 +01:00
Piotr Jastrzebski
50b41f7d1d Fix row_cache_test
partition_range passed to row_cache::make_reader
has to be kept alive as long as the resulting reader
is used.

Otherwise weird things start to happen.

This used to work just because of a pure luck.
When I started changing the row_cache implementation
I run into very weird behaviors for this tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>
2016-11-08 13:28:53 +01:00
Calle Wilund
473326d49a api/column_family: Make mean row size return integral
As (at least) per C3, these metrics are integral in origin. Adapt.
(Other option would be to translate in jmx).
2016-11-08 12:22:04 +00:00
Calle Wilund
bd646a6755 repair (api): Add option handling (sort of) for nodetool default options 2016-11-08 12:22:04 +00:00
Calle Wilund
0181fc8159 api::cache_service: Add (dummy) calls for key&counter metrics 2016-11-08 12:22:04 +00:00
Calle Wilund
5eb54f9bc4 api::storage_service: c3 compat - make query keyspaces a trinary choice
all, user or non-local strategy ones.
2016-11-08 12:22:04 +00:00
Calle Wilund
3b7a7dd383 api::failure_detector: c3 compat - add endpoint phi value query 2016-11-08 12:22:04 +00:00
Calle Wilund
218df55349 failure_detector: add accessor and api shortcut for arrival samples 2016-11-08 12:22:04 +00:00
Calle Wilund
f9836cd23b api::endpoint_snitch: c3 compat - allow dc/rack query for broadcast 2016-11-08 12:22:04 +00:00
Calle Wilund
54ba06a8bf api::column_family: Add calls/parameters for c3 compatibility 2016-11-08 12:22:04 +00:00
Amnon Heiman
c8082ccadb API: fix a type in storage_proxy
This patch fixes a typo in the URL definition, causing the metric in the
jmx not to find it.

Fixes #1821

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1478563869-20504-1-git-send-email-amnon@scylladb.com>
2016-11-08 11:09:21 +02:00
Amos Kong
95fe88c1d3 scripts/scylla_current_repo: use HTTP to access downloads.scylladb.com
Https isn't available for downloads.scylladb.com, or we can access
it by https://s3.amazonaws.com/downloads.scylladb.com/...

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <d4b65e1724bbeb76c928790d5d3e95b91ee9db79.1478153034.git.amos@scylladb.com>
2016-11-08 11:03:50 +02:00
Avi Kivity
767cfb4fe9 storage_service: fix range wrapping in describe_ring even more
Commit 8fca1887c2 ("storage_service: fix range wrapping in
describe_ring") fixed incorrect range wrapping code for describe_ring,
but fails when the number of endpoints for a token is greater than one,
because the endpoints are stored in an unordered vector.

Fix by comparing the endpoints in a way that ignores their order.
Message-Id: <1478460826-15923-1-git-send-email-avi@scylladb.com>
2016-11-07 16:18:20 +01:00
Calle Wilund
11baf37ab5 commitlog: Prevent exceptions in stream::produce from being set twice
Fixes #1775
stream lacks a check "is_open", which is a bummer. We have to both
prevent exception propagation and add a flag of our own to make sure
exceptions in producer code reaches consumer, and does not simply
get lost in the reactor.
Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>
2016-11-07 11:41:33 +01:00
Tomasz Grabiec
e6cc0a2e10 Merge branch '1766/v1' from duarten/scylla.git
This patchset adds missing properties to the create_view_statement,
such as whether the view is compact or the order of its clustering
columns.

Fixes #1766
2016-11-07 10:44:24 +01:00
Takuya ASADA
0f1ba1a3bb dist/redhat: remove unused dependencies
Seems like we mistakenly added unneeded packages for BuildRequires when
we created .spec file, so remove them.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478504761-15067-1-git-send-email-syuu@scylladb.com>
2016-11-07 09:48:50 +02:00
Paweł Dziepak
985d2f6d4a Merge "Remove quadratic behavior from atomic sstable deletion" from Avi
"The atomic sstable deletion provides exception safety at the cost of
quadratic behavior in the number of sstables awaiting deletion.  This
causes high cpu utilization during startup.

Change the code to avoid quadratic complexity, and add some unit tests.

See #1812."
2016-11-04 15:48:04 +00:00
Avi Kivity
f75aceabc5 sstables: add unit tests for atomic deletion
We simulate shards deleting sstables, but this is all happening on a single
core, and no sstables are harmed during test execution.
2016-11-04 15:48:43 +02:00
Avi Kivity
f10b9906d8 sstables: move atomic deletion code to its own files
This will simplify unit testing.  We move generic code that
depends only on seastar, so compile time should not increase too much.
2016-11-04 15:47:35 +02:00
Avi Kivity
9e85653c33 sstables: make atomic_deletion_manager more abstract
Make the shard count and method of deleting sstables abstract, in order
not to require all that machinery for unit tests.
2016-11-04 15:44:09 +02:00
Avi Kivity
e527da1e3c sstables: wrap atomic deletion code in a class
This makes it easier to abstract and unit-test.
2016-11-04 15:44:07 +02:00
Avi Kivity
a05837936a sstables: remove quadratic behavior from atomic sstable deletions
In order to ensure exception safety, the atomic sstable deletion code
creates a copy of the list of sstables pending deletion, modifies that
copy, and then replaces the original data with the copy.  This guarantees
that any exception does not change the data, since the assignment does
not require allocation.

However, it does result in quadratic behavior.  During startup, all
sstables are loaded on each shard, and each shard deletes sstables that
are do not have any partitions served by that shard; this results in
almost all sstables being deleted from all shards, with all that work
going to shard 0; the list grows to O(nr sstables), and there are
O((nr sstables) * (nr shards)) operations to perform.

Fix by replacing the copy-modify-assign method with an in-place update,
but one that is designed to only commit changes after all allocations
have been made; in addition, instead of using a list, use a hash table,
removing another source of quadratic behavior.

Fixes #1812 (the quadratic beahvior part).
2016-11-04 15:42:44 +02:00
Avi Kivity
8fca1887c2 storage_service: fix range wrapping in describe_ring
describe_ring() tries to re-wrap the ranges, but fails because the ranges
are not sorted.  Adjust the code not to rely on sorting.
Message-Id: <1478198630-27483-1-git-send-email-avi@scylladb.com>
2016-11-04 10:48:14 +00:00
Paweł Dziepak
8afd9e52c7 Merge "Process range queries sequentially on shards" from Avi
"Currently, partition range queries are processed in parallel on all
shards. This is inefficient because we are likely to drop the results
from all but one shard, assuming a well-populated column family.  We
are multiplying our work by a factor of smp::count.

While this is worthwhile in its own right, it is really an excuse to
sneak in the range/shard generator (patch 5), which is preliminary for
a new sharding algorithm, dividing tokens among shards based on the
middle-significant bits rather than the most-siginificant bits (which
alias with vnodes)

Fixes #1573."
2016-11-04 09:58:04 +00:00
Tomasz Grabiec
c1a7e2090e Revert "database: change find_column_families signature so it returns a lw_shared_ptr"
This reverts commit f3528ede65.
2016-11-04 10:48:21 +01:00
Tomasz Grabiec
3b5ccda70e Revert "database: refactor code so apply_in_memory() is called only once"
This reverts commit 3f825f593d.
2016-11-04 10:48:18 +01:00
Tomasz Grabiec
6366eb5cf8 Revert "correctly calculate latencies for writes"
This reverts commit a382f10fc4.
2016-11-04 10:48:02 +01:00
Tomasz Grabiec
a5ee87611a Revert "database: when querying, move latency counter instead of copying"
This reverts commit 8840a5a593.
2016-11-04 10:47:58 +01:00
Tomasz Grabiec
f3c1ff78e6 Merge branch 'cql_read_write_counters-v4' from seastar-dev.git
New CQL counters from Vlad.
2016-11-04 09:19:07 +01:00
Avi Kivity
b3299d5bc3 storage_proxy: simplify range queries
Instead of asking a shard for cmd->partition_limit and cmd->row_limit,
just ask it for the number of partitions and rows still needed to
satisfy the query.  This removes the need to trim the shard's result.
2016-11-03 19:10:20 +02:00
Avi Kivity
a668e575f6 storage_proxy: execute multi-partition query sequentially over shards
Since every shard might cause the row_limit quota to be satisfied, every
shard might be the last one we need.  Hence it is better to process shards
sequentially, stopping if the quota is reached or the range is exhausted.

The original code tried to yield to reduce latency, but this is now
unnecessary, as we're doing a lot less work per iteration (if it becomes
necessary, we should do it on the replica shard, not the coordinating shard).
2016-11-03 19:10:20 +02:00
Avi Kivity
1d77e3a03a partitioner: add unit tests for token_for_next_shard()
i_partitioner::token_for_next_shard() is an inverse for
i_partitioner::shard_of(), test that this is so.
2016-11-03 19:10:20 +02:00
Avi Kivity
7202b94183 dht: introduce a sharder for vectors of partition ranges
Building on the single-range sharder, add a sharder for vectors of
partition ranges.  This helps with wrapped ranges, which are translated
into a vector containing two shards.
2016-11-03 19:10:20 +02:00
Avi Kivity
43a2380899 dht: add a generator for shard/range pairs
Divides a ring_position range into a sequence of shard/range pairs.  This
allows sequential iteration over shards in ring order.

The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page.  With the generator, we can switch to sequential
execution.
2016-11-03 19:10:17 +02:00
Avi Kivity
1f88d103a8 partitioner: add i_partitioner::token_for_next_shard()
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.

To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.

It is a sort-of inverse to shard_of(), in that

  shard_of(token_for_next_range(t)) == shard_of(t) + 1
2016-11-03 19:09:23 +02:00
Vlad Zolotarov
6c15dd967a cql3::query_processor: make the collectd metrics registration nicer
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:20 -04:00
Vlad Zolotarov
36cc351ae1 cql3::query_processor: add a counter for BATCH CQL statements
- Add a "batches" member to cql_stats.
   - Update it where appropriate.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:20 -04:00
Vlad Zolotarov
6e1d27bed1 cql3::query_processor: add a counter for a number of CQL modification requests ("writes")
- Add a inserts, updates, deletes members to cql_stats.
   - Store cql_stats& in a modification_statement and increment the corresponding counter according to the value of a "type" field.
   - Store cql_stats& in a batch_statement and increment the statistics for each BATCH member.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:15 -04:00
Vlad Zolotarov
fa4e1db0cb cql: add a counter for CQL read (SELECT) requests
- Add a "reads" counter to a cql3::cql_stats struct.
   - Store a reference for a query_processor::_cql_stats in the select_statement object.
   - Increment a "reads" counter where needed.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:48:57 -04:00
Vlad Zolotarov
7606588267 cql3::query_processor: add cql_stats
- Add cql_stats member.
   - Pass it to cql3::raw::parsed_statement::prepare() virtual method.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:48:57 -04:00
Glauber Costa
8840a5a593 database: when querying, move latency counter instead of copying
It is comprised of two time points. Let's move it instead of copying it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c7c155c77780e188bfbe05881c81ce86456016d5.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
a382f10fc4 correctly calculate latencies for writes
Right now we are calculating latencies only when we are about to add an
item to the memtable.

That's incorrect and misleading, for two reasons. First, it leaves the
commitlog latencies out. But second, it is done after the memtable wall
effect is applied, which means we are not counting throttle time neither
in the memtables or in the commitlog.

To do that, we'll start the latency_counter object as soon as possible
and move it all the way to apply_in_memory(). That should span the
entire write operation.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
3f825f593d database: refactor code so apply_in_memory() is called only once
There are two variants of apply_in_memory() being called in do_apply():
with and without the commitlog. The main differences are that when the
commitlog is involved, we need to wait for its future to complete before
moving to apply_in_memory. That can easily be factored out by providing
an always-ready future if we don't have the commitlog enabled, and
waiting on that.

The second, is that the commitlog version can cause apply_in_memory to
generate an exception if there is replay position reordering. However,
there is no harm in appending the exception handler to both versions. In
one of them it's an impossible exception, but that's fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8cee0cad9b1930a057a24e095f0a655069ae8be2.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
f3528ede65 database: change find_column_families signature so it returns a lw_shared_ptr
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.

If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Avi Kivity
6c45b0bae8 partitioner: make comparators public
The public comparison operators depend on global_partitioner(), and are
therefore less useful for tests.
2016-11-03 11:27:40 +02:00
Avi Kivity
6320181b97 partitioner: const correctness for comparators 2016-11-03 11:27:40 +02:00
Avi Kivity
470826d127 partitioner: change partitioners to have shard counts independent from smp::count
Useful for testing.
2016-11-03 11:27:40 +02:00
Avi Kivity
75706c0a26 size_estimates_recorder: sort token range before rewrapping it
Since size estimates are stored as wrapped ranges, we call compat::wrap()
to convert from the now-standard unwrapped ranges back to wrapped ranges.
However, compat::wrap() relies on the ranges being in sorted order,
but our input is not.  This leads to a crash as we find an unexpected
empty token in the middle of the vector.

Sort it so compat::wrap() works as expected.

Fixes #1804.
Message-Id: <1478161908-25051-1-git-send-email-avi@scylladb.com>
2016-11-03 09:43:41 +01:00
Avi Kivity
a35136533d Convert ring_position and token ranges to be nonwrapping
Wrapping ranges are a pain, so we are moving wrap handling to the edges.

Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.

Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
2016-11-02 21:04:11 +02:00
Takuya ASADA
8c55c99353 dist/common/scripts/scylla_io_setup: pass --smp option to iotune command
We were ignored --smp option taken from io.conf since iotune didn't supported
it, but now it supported we can pass it.
(We need to pass it because we need to measure io performance on same condition
with scylla)

Fixes #1768

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478082591-27205-1-git-send-email-syuu@scylladb.com>
2016-11-02 12:49:50 +02:00
Raphael S. Carvalho
53b7b7def3 sstables: handle unrecognized sstable component
As in C*, unrecognized sstable components should be ignored when
loading a sstable. At the moment, Scylla fails to do so and will
not boot as a result. In addition, unknown components should be
remembered when moving a sstable or changing its generation.

Fixes #1780.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b7af0c28e5b574fd577a7a1d28fb006ac197aa0a.1478025930.git.raphaelsc@scylladb.com>
2016-11-02 12:44:53 +02:00
Avi Kivity
72c2982260 dist: require scylla-boost-static for EL RPM build 2016-11-01 18:55:55 +02:00
Pekka Enberg
e1e8ca2788 cql3: Fix selecting same column multiple times
Under the hood, the selectable::add_and_get_index() function
deliberately filters out duplicate columns. This causes
simple_selector::get_output_row() to return a row with all duplicate
columns filtered out, which triggers and assertion because of row
mismatch with metadata (which contains the duplicate columns).

The fix is rather simple: just make selection::from_selectors() use
selection_with_processing if the number of selectors and column
definitions doesn't match -- like Apache Cassandra does.

Fixes #1367
Message-Id: <1477989740-6485-1-git-send-email-penberg@scylladb.com>
2016-11-01 09:09:01 +00:00
Pekka Enberg
d46ed53e9e scripts: add update-version
This patch adds an `update-version` script for updating the Scylla
version number in `SCYLLA-VERSION-GEN` file and committing the change to
git.

Example use:

  $ ./scripts/update-version 1.4.0

which results into the following git commit:

  commit 4599c16d9292d8d9299b40a3e44ef7ee80e3c3cf
  Author: Pekka Enberg <penberg@scylladb.com>
  Date:   Fri Oct 28 10:24:52 2016 +0300

      release: prepare for 1.4.0

  diff --git a/SCYLLA-VERSION-GEN b/SCYLLA-VERSION-GEN
  index 753c982..eba2da4 100755
  --- a/SCYLLA-VERSION-GEN
  +++ b/SCYLLA-VERSION-GEN
  @@ -1,6 +1,6 @@
   #!/bin/sh

  -VERSION=666.development
  +VERSION=1.4.0

   if test -f version
   then

Message-Id: <1477639560-10896-1-git-send-email-penberg@scylladb.com>
2016-10-30 12:43:41 +02:00
Avi Kivity
feb8faf70b Merge "make refresh resilient to permission denied error" from Raphael
Fixes #1709.

* 'refresh-resilient-v3' of github.com:raphaelsc/scylla:
  db: make refresh resilient to permission denied error
  db: make it possible to use custom error handler with io checker
  sstables: remove duplicated declaration of remove_by_toc_name
2016-10-30 10:28:09 +02:00
Takuya ASADA
68d9f5212c dist/ubuntu/dep/thrift.diff: add missing build time dependency
We need libcrypto header to build thrift, so add it.

Fixes #1798

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477676716-5726-1-git-send-email-syuu@scylladb.com>
2016-10-29 17:49:30 +03:00
Avi Kivity
71532d8cd5 Merge seastar upstream
* seastar 05f6c5c...47e1821 (1):
  > rpc: Avoid using zero-copy interface of output_stream (Fixes #1786)
2016-10-28 14:09:16 +03:00
Avi Kivity
e03ca06431 dist: fix rpm build
--static-boost is supposed to be an input to ./configure.py, not ninja.  Move
it there.
2016-10-28 08:42:26 +03:00
Pekka Enberg
b54870764f auth: Fix resource level handling
We use `data_resource` class in the CQL parser, which let's users refer
to a table resource without specifying a keyspace. This asserts out in
get_level() for no good reason as we already know the intented level
based on the constructor. Therefore, change `data_resource` to track the
level like upstream Cassandra does and use that.

Fixes #1790

Message-Id: <1477599169-2945-1-git-send-email-penberg@scylladb.com>
2016-10-27 23:37:26 +03:00
Glauber Costa
ef3c7ab38e auth: always convert string to upper case before comparing
We store all auth perm strings in upper case, but the user might very
well pass this in upper case.

We could use a standard key comparator / hash here, but since the
strings tend to be small, the new sstring will likely be allocated in
the stack here and this approach yields significantly less code.

Fixes #1791.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <51df92451e6e0a6325a005c19c95eaa55270da61.1477594199.git.glauber@scylladb.com>
2016-10-27 22:08:57 +03:00
Raphael S. Carvalho
d11e839520 db: make refresh resilient to permission denied error
User may forget to set permission of new sstables in upload dir
before refreshing them, and that will result in shutdown.
io_checker is now able to work with a custom handler, so all we
have to do is to whitelist EACCES.

Fixes #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 16:50:40 -02:00
Raphael S. Carvalho
a3e065da9b db: make it possible to use custom error handler with io checker
By default, io checker will cause Scylla to shutdown if it finds
specific system errors. Right now, io checker isn't flexible
enough to allow a specialized handler. For example, we don't want
to Scylla to shutdown if there's an permission problem when
uploading new files from upload dir. This desired flexibility is
made possible here by allowing a handler parameter to io check
functions and also changing existing code to take advantage of it.
That's a step towards fixing #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 15:54:21 -02:00
Takuya ASADA
a1b7e76d43 dist/ubuntu: support 16.10
Add 16.10 to 'supported_release'

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477585454-2115-1-git-send-email-syuu@scylladb.com>
2016-10-27 19:26:14 +03:00
Takuya ASADA
36e831a106 dist/common/scripts/scylla_bootparam_setup: support EC2 paravirtual instances
EC2 paravirtual instances uses pv-grub, which refers /boot/grub/menu.lst (grub0.9x config file) instead of grub2 config file.
So add boot parameters on /boot/grub/menu.lst when the file exists, and the instance is on EC2.

Fixes #1598

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472056875-17512-1-git-send-email-syuu@scylladb.com>
2016-10-27 18:55:05 +03:00
Avi Kivity
402a3f1c9f Merge seastar upstream
* seastar 9bed76a...05f6c5c (5):
  > reactor: improve task quota timer resolution
  > Update dpdk submodule to local-patches-20161027 tag
  > tests: wire up json_formatter_test
  > json_formatter_test: Add rudimentary json formatter test
  > scripts/posix_net_conf.sh: detect IRQs of virtio-net and xen_netfront correctly
2016-10-27 18:19:40 +03:00
Avi Kivity
e995f5a3a7 dist: statically link with boost on RHEL
Reduces runtime dependencies on Scylla-provided third-party boost packages.

Message-Id: <1477552490-28961-1-git-send-email-avi@scylladb.com>
2016-10-27 12:35:12 +03:00
Avi Kivity
76628a7b0b dist: make wget quieter
wget is often used from scripts recording to logs; as it emits a log
line every second, the logs are huge and unreadable.  Make it quieter.

Message-Id: <1477558534-32718-1-git-send-email-avi@scylladb.com>
2016-10-27 12:11:26 +03:00
Avi Kivity
72d78ffa7e Merge "Cache fixes" from Paweł
"5ff699e09fcbd62611e78b9de601f6c8636ab2f0 ("row_cache: rework cache to
use fast forwarding reader") brought some significant changes to the
row cache implementation. Unfortunately, "significant changes" often
translates to "more bugs" and this time was no different.

This series contains fixes for the problems introduced in that rework
and makes failing dtest
bootstrap_test.py:TestBootstrap.local_quorum_bootstrap_test
pass again."

* 'pdziepak/cache-fixes/v1' of github.com:cloudius-systems/seastar-dev:
  row_cache: avoid dereferencing invalid iterator
  row_cache: set _first_element flag correctly
  row_cache: fix clearing continuity flag at eviction
2016-10-27 11:44:15 +03:00
Takuya ASADA
5cb7dc5dc3 dist/ubuntu/dep: update thrift to 0.9.3
To make thrift compilable on gcc-6.2, we need to upgrade latest version of
thrift.
This is required to support Ubuntu 16.10.

Fixes #1784

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477517671-18067-1-git-send-email-syuu@scylladb.com>
2016-10-27 10:22:06 +03:00
Paweł Dziepak
a7224ae46e row_cache: avoid dereferencing invalid iterator
Conditions in row_cache::do_find_or_create_entry() make it possible that
std::prev(it) is going to be dereferenced even if it is a begin
iterator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:24:23 +01:00
Paweł Dziepak
654f651e0c row_cache: set _first_element flag correctly
If the continuity flag was set for the first element _first_element flag
would not be cleared. This shouldn't cause any correctness problems but
properly setting the flag allows to avoid some unnecessary key
comparisons.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:07:24 +01:00
Paweł Dziepak
567ff96f2a row_cache: fix clearing continuity flag at eviction
In original implementation the continuity flag indicated that cache has
full information about the range the between current partition and the
one following it, hence when evicting an entry the one preceeding it
had to have its continuity flag cleared.

This was changed, however, and now the continuiy flag tells whether the
cache is continuous between the current element and the one before it.
This means that eviction code needs to clear the flag for the entry
directly following the evicted one.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 14:58:20 +01:00
Raphael S. Carvalho
bc2d351c25 sstables: remove duplicated declaration of remove_by_toc_name
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-26 11:21:27 -02:00
Takuya ASADA
7617adadf4 dist/ami/files/.bash_profile: fix confusing message when running AMI on unsupported instance type
To describe witch instance type is supported, show document URL instead of
confusing message.

Fixes #1646

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477473336-25373-1-git-send-email-syuu@scylladb.com>
2016-10-26 12:48:51 +03:00
Avi Kivity
7faf2eed2f build: support for linking statically with boost
Remove assumptions in the build system about dynamically linked boost unit
tests.  Includes seastar update which would have otherwise broken the
build.
2016-10-26 08:51:21 +03:00
Piotr Jastrzebski
27726cecff Clean up position_in_partition.
Introduce position_in_partition_view and use it in
position() method in mutation_fragment, range_tombstone,
static_row and clustering_row.
Clean up comparators in position_in_partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c65293c71a6aa23cf930ed317fb63df1fdc34fd1.1477399763.git.piotr@scylladb.com>
2016-10-25 15:13:20 +01:00
Tomasz Grabiec
cbaae2bf7f Merge seastar upstream
* seastar e18205b...3777135 (1):
  > rpc: Do not close client connection on error response for a timed out request

Fixes #1778
2016-10-25 13:59:41 +02:00
Raphael S. Carvalho
975ce62dbc sstables: do not swallow exception when reading TOC
That caused problem when refreshing a sstable with bad permissions.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <48e5322c53234209e55da05c64c99b8ec4e190a3.1477372974.git.raphaelsc@scylladb.com>
2016-10-25 12:21:32 +03:00
Avi Kivity
ddd4dbf928 Update scylla-ami submodule
* dist/ami/files/scylla-ami e1e3919...61ff5c6 (1):
  > scylla_ami_setup: run posix_net_conf.sh when NCPUS < 8
2016-10-25 11:18:58 +03:00
Avi Kivity
4b55a687b6 Merge seastar upstream
* seastar 98b5a2d...e18205b (1):
  > json::formatter: Add formatters for maps + rudimentary test
2016-10-25 11:17:29 +03:00
Avi Kivity
e8edaaf6a4 Merge seastar upstream
* seastar 69acec1...98b5a2d (9):
  > rpc: Silence warning about ignored failed future
  > future: prioritise continuations that can run immediately
  > iotune: relax aio restrictions
  > build: support for static linking with boost
  > rpc: Fix crash during connection teardown
  > rpc: Move _connected flag to protocol::connection
  > rpc test: fail test if exception is thrown during test execution
  > rpc: do not assume underling semaphore type
  > rpc: fix default resource limit
2016-10-25 11:09:40 +03:00
Avi Kivity
fc8210a875 tests: fix tests with boost 1.60
In boost 1.60, the executable's command-line arguments are expected to
be separated from the boost command-line arguments by '--'.  Detect
this requirement and comply with it.
Message-Id: <1477212424-3831-1-git-send-email-avi@scylladb.com>
2016-10-24 09:36:56 +02:00
Avi Kivity
37f112b610 dist: add python3-yaml to ununtu dependencies for blocktune 2016-10-23 16:42:13 +03:00
Avi Kivity
7d50d6df9b blocktune: fix syntax error in exception handling 2016-10-23 16:40:00 +03:00
Avi Kivity
e261a380a9 dist: add PyYAML dependency to rpm (for blocktune) 2016-10-23 10:36:29 +03:00
Raphael S. Carvalho
fa308c079c database: fix collectd metrics for clustering key filter
Same instance name was used for exported metrics, which is
definitely wrong. Checked it works properly now via collectd
exporter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <471a36706113af60aeba86fb56a365feb4dab31a.1477086706.git.raphaelsc@scylladb.com>
2016-10-22 09:51:18 +03:00
Glauber Costa
a13c410749 commitlog: cycle based on total size, not on mutation size
We calculate two sizes during the allocation: "size", which is the
in-segment size of this mutation, and "s", which is that plus the
overhead. cycle() must be called with the latter, not the former, as
doing otherwise may lead to buffer overflows.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ccf346d8d0ebb44a1ba9fd069653bab0d7be0a61.1477063157.git.glauber@scylladb.com>
2016-10-21 18:57:41 +03:00
Glauber Costa
d9875784a1 commitlog: do not wait on pending operations for batch mode
This was explicitly mentioned in my set as gone in one of the versions.
Somehow it came back in the final version - sorry about that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2a0eba28cd74267d1a1fdcf1aef2901cc74ffc9f.1477059963.git.glauber@scylladb.com>
2016-10-21 17:27:16 +03:00
Vlad Zolotarov
f75a350a8f service::storage_proxy: use global_trace_state_ptr when using invoke_on
When trace_state may migrate to a different shard a global_trace_state_ptr
has to be used.

This patch completes the patch below:

commit 7e180c7bd3
Author: Vlad Zolotarov <vladz@cloudius-systems.com>
Date:   Tue Sep 20 19:09:27 2016 +0300

    tracing: introduce the tracing::global_trace_state_ptr class

Fixes #1770

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1476993537-27388-1-git-send-email-vladz@cloudius-systems.com>
2016-10-21 11:34:13 +03:00
Avi Kivity
e3ae54f0fe Merge "Rework commitlog to avoid timeouts" from Glauber
"This patchset reworks the commitlog logic to better handle conditions in
which we are getting requests faster than the disk can handle. It does
this by building a wall around the commitlog and only allowing
allocations to proceed when we are under the desired memory threshold.

The main advantage of that is that we can now easily set the commitlog
to work at disk speed, more or less allowing an "one byte in for each
byte out" approach instead of depending on the current cycle to finish.
As a result, max latencies are greatly reduced.

Testing Results
===============

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

I have run this with larger sizes as well, and it generally performs
much better than the baseline version. For sizes up to 5MB, I have seen
no timeouts in my setup. After that, I see some timeouts. Buffer
splitting is expected to make this better.

Aside from performance testing, this was also tested with batch and
periodic mode for various requests sizes."
2016-10-20 16:44:39 +03:00
Glauber Costa
d5618c6ace commitlog: add total_operations type for requests_blocked_memory
Current tracker for pending allocations is a queue_size GAUGE.  Add a
total_operations version so we have more insight on what's going on.

It will be called requests_blocked_memory for consistency with other
subsystems that track similar things.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-20 09:25:38 -04:00
Avi Kivity
db2f5e6be1 blocktune: wire up blocktune on startup
Message-Id: <1476357027-15014-3-git-send-email-avi@scylladb.com>
2016-10-20 13:24:05 +03:00
Avi Kivity
098d02ad1a scylla-blocktune: introduce
scylla-blocktune is a script that parses scylla.yaml and tunes the data file
and commitlog directories it references.

Tuning includes:
 - set the I/O scheduler to noop
 - disable merging
 - tune dependent devices (like RAID members)

Message-Id: <1476357027-15014-2-git-send-email-avi@scylladb.com>
2016-10-20 13:24:05 +03:00
Avi Kivity
fad34eef6c scylla_raid_setup: don't mess with read-ahead
It doesn't affect O_DIRECT reads, and it's not persistent.

Message-Id: <1476269082-2473-2-git-send-email-avi@scylladb.com>
2016-10-20 13:23:38 +03:00
Avi Kivity
a837da06ef scylla_raid_setup: increase chunk size
The current chunk size of 256 gives a 50% probability of a 128k read or
write getting split into two accesses.  This reduces efficiency and
increases latency.

Change the chunk size to 1MB, with a 12% probability of cross-member
access.

Message-Id: <1476269082-2473-1-git-send-email-avi@scylladb.com>
2016-10-20 13:23:38 +03:00
Takuya ASADA
80e3d8286c dist/ami: fix incorrect /etc/fstab entry on CentOS7 base image
There was incorrect rootfs entry on /etc/fstab:
 /dev/sda1 / xfs defaults,noatime 1 1
This causes boot error when updated to new kernel.
(see:
https://github.com/scylladb/scylla/issues/1597#issuecomment-250243187)

So replaced the entry to
 UUID=<uuid>  / xfs defaults,noatime 1 1
Also all recent security updates applied.

Fixes #1597
Fixes #1707

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475094957-9464-1-git-send-email-syuu@scylladb.com>
2016-10-20 11:48:24 +03:00
Takuya ASADA
5f602752a5 dist/ubuntu: backport g++-5 from Debian 9(stretch) to Debian 8(jessie)
Since Debian 8(jessie) does not provides g++-5, we frequently got compile error
because we are using older compiler.
To fix the problem, backport g++-5 from Debian 9(stretch).

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-3-git-send-email-syuu@scylladb.com>
2016-10-20 11:41:02 +03:00
Takuya ASADA
7d67504b56 dist/ubuntu: use VERSION_ID from /etc/os-release instead of 'lsb_release -r'
On Debian, lsb_release -r returns the version number something like '8.6'.
However, on this script we want to check major version only.
Therefore we can use VERSION_ID from /etc/os-release which only contains
major version number.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-2-git-send-email-syuu@scylladb.com>
2016-10-20 11:41:02 +03:00
Avi Kivity
0da2f64cfb Merge seastsar upstream
* seastar ccd8649...69acec1 (2):
  > app/iotune: add --smp option
  > rpc: Add missing adjustment of snd_buf::size

Fixes #1767.
Fixes #1768.
2016-10-20 11:16:40 +03:00
Paweł Dziepak
210a390892 tests: add missing sstable for partition skipping test
Commit 7dcd70124a "tests/sstables: add
test for fast forwarding reader" added a test for skipping parts of
sstable. Unfortunately, it did not include the sstables it was trying to
read.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 23:23:49 +01:00
Glauber Costa
1578d7363a commitlog: rework blocking logic
The current incarnation of commitlog establishes a maximum amount of
writes that can be in-flight, and blocks new requests after that limit
is reached.

That is obviously something we must do, but the current approach to it
is problematic for two main reasons:

1) It forces the requests that trigger a write to wait on the current
   write to finish. That is excessive; ideally we would wait for one
   particular write to finish, not necessarily the current one. That
   is made worse by the fact that when a write is followed by a flush
   (happens when we move to a new segment), then we must wait for
   *all* writes in that segment to finish.

1) it casts concurrency in terms of writes instead of memory, which
   makes the aforementioned problem a lot worse: if we have very big
   buffers in flight and we must wait for them to finish, that can
   take a long time, often in the order of seconds, causing timeouts.

The approach taken by this patch is to replace the _write_semaphore
with a request_controller. This data structure will account the amount
of memory used by the buffers and set a limit on it. New allocations
will be held until we go below that limit, and will be released
as soon as this happens.

This guarantees that the latencies introduced by this mechanism are
spread out a lot better among requests and will keep higher percentile
latencies in check.

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:56:36 -04:00
Glauber Costa
aec724bbda commitlog: factor out code for checking mutation size
In a subsequent patch, I'll use this code in a different place. To
prepare for that, we move it out as a method. It also fits a lot better
inside the segment manager, so move it there.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
a50996f376 commitlog: calculate segment-independent size of mutations
Goal is to calculate a size that is lesser or equal than the
segment-dependent size.

This was originally written by Tomasz, and featured in his submission
"commitlog: Handle overload more gracefully"

Extracted here so it sits clearly in a different patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
0b7c9fa17f commitlog: remove _needed_size
It is mostly an optimization, and while it makes sense in this context,
it won't soon as we'll stop waiting for the current cycle specifically
to finish.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
6214bdeb66 commitlog: move segment_manager constructor outside the class definition
We'll do that so we can, in following patches, use static members from
the segment. Those are not defined at this point.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
299877f432 commitlog: add a counter for pending allocations
We track the amount of pending allocations but we don't really export
it. It will be crucial when we stop tracking pending writes.

This patch exports it through a method instead of the totals structure,
so we can easily change it. Current code probing pending_allocations
(the api code) is also converted to use the public method instead of the
totals struct.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Avi Kivity
07c995ab3d Merge "Fast forward mutation readers" from Paweł
"This patchset enables mutation readers to be fast forwarded to a different
partition range. The main reason for introducing such feature are range
queries served from cache. If the cache is partially populated in the
requested range the reader will end up with multiple subranges that have
to be read from the sstables. Originally, each of these subranges would
require a new reader to be created, but with fast forwarding we can have
just one sstable reader. This is better since there is a chance that buffers
kept by the reader may be still useful after fast forwarding it.

In this series there are also patches that clean up cache readers in order
to make integration with fast forwarding easier. Namely, continuity flag is
changed to store information about range before the entry which significantly
simplifies the logic.

Fixes #1299."

* 'pdziepak/fast-forward-mutation-readers/v5' of github.com:cloudius-systems/seastar-dev: (24 commits)
  sstables: keep separate stream history for single and range reads
  sstables: drop sstable::{lower, upper}_bound()
  row_cache: rework cache to use fast forwarding reader
  row_cache: put cache entry flags in a struct
  row_cache: add do_find_or_create_entry() to reduce code duplication
  mutation_reader: forward fast_forward_to() calls
  tests/row_cache: add fast_forward_to() to throttled reader
  tests/row_cache: count mutations read from _underlying
  memtable: add support for fast_forward_to()
  drop key readers
  tests/mutation_reader: test fast forwarding combined reader
  database: enable fast forwarding of range_sstable_reader
  combined_mutation_reader: implement fast_forward_to()
  mutation_reader: make combinded_reader public
  tests/sstables: add test for fast forwarding reader
  tests: add more helpers to mutation reader assertions
  sstables: enable fast forwarding for range readers
  mutation_reader: introduce fast_forward_to()
  sstables: implement mutation_reader::impl::fast_forward_to()
  sstables: introduce index_reader
  ...
2016-10-19 18:10:44 +03:00
Paweł Dziepak
ab0eeae82d sstables: keep separate stream history for single and range reads
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
20bfa1fa52 sstables: drop sstable::{lower, upper}_bound()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5ff699e09f row_cache: rework cache to use fast forwarding reader
This uncomfortably large patch overhauls cache range reader so that it
can take advantage of fast forwarding mutation readers.

A significant change in the cache itself is that the continuity flag now
is used to determine whether cache is contiguous between the previous
entry and the current one. This allows for a significant simplification
of the cache code and easier integration with reader fast forwarding.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
18acb0c0e6 row_cache: put cache entry flags in a struct
Flags are easier to manage if they are in a single structure.
Especially, default initialization and move contstructors are simpler
and less error prone.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f248e23db5 row_cache: add do_find_or_create_entry() to reduce code duplication
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
bcd374c05d mutation_reader: forward fast_forward_to() calls
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0c24bbe639 tests/row_cache: add fast_forward_to() to throttled reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
69645455f3 tests/row_cache: count mutations read from _underlying
Originally, cache tests checked how many times a mutation reader was
created from the underlying mutation source to determine whether
continuity flag is working correctly.

This is not going to work with fast forwarding mutation readers so the
test is switched to count number of mutations (+ end of stream markers)
returned from underlying mutaiton readers which is much less fragile.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
e14f8027d5 memtable: add support for fast_forward_to()
Fast forwarding of memtable readers is needed only for unit tests which
often use memtables as underlying data source for cache and the cache
readers.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
6755a679f6 drop key readers
key_readers weren't used since introduction of continuity flag to cache
entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5ac9babe97 tests/mutation_reader: test fast forwarding combined reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7bebfb851f database: enable fast forwarding of range_sstable_reader
When fast forwarding a reader that combines sstable reader we must also
remember that the set of sstables for the new range may be different
than for the previous one. The reader introduced in this patch makes
sure that we read from correct sstables.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
b7b7b2bd63 combined_mutation_reader: implement fast_forward_to()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2c0cdd55fc mutation_reader: make combinded_reader public
We want to be able to fast forward sstable readers. However, just
implementing fast_forward_to() for combined_reader is not enough as the
sstables we are reading from may need to change.

Following patches are going to introduce a combined sstable reader that
derives from combined_reader. To make that possible we first need to
make combined_reader public.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7dcd70124a tests/sstables: add test for fast forwarding reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5534dc2817 tests: add more helpers to mutation reader assertions
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
cf024975fe sstables: enable fast forwarding for range readers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
62c9492d33 mutation_reader: introduce fast_forward_to()
This patch introduces the interface for fast forwarding mutation
readers. The main user of this feature is going to be cache which, while
serving range query, may need to read multiple small ranges from the
sstables to populate itself with the missing entries.

Fast forwarding is an alternative to recreating a reader with different
range. Its main advantage is fact that it avoids dropping data that has
already been read.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
c63e88d556 sstables: implement mutation_reader::impl::fast_forward_to()
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
a530762277 sstables: introduce index_reader
index_reader is a helper that implements index lookups. Its goal is to
avoid dropping read buffers if they still may be needed (for example to
get end bound of the range or after fast forwarding the reader).

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f49a9e0d64 sstables: drop unused read_range_rows() overload
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0bc873ace5 sstables: add fast_forward_to() to continuous_data_consumer
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
25b91c51e2 ssables: add data_consume_rows_context::reset()
reset() is going to be used to restore valid state after fast forwarding
the reader.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2124d08b88 sstables: add skip() to compressed_file_data_source
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
54069162f5 Merge "Add test for partition version list consistency after compaction" from Tomek 2016-10-18 11:03:25 +01:00
Tomasz Grabiec
308434f891 tests: memtable: Add test for partition version list consistency after compaction 2016-10-18 11:57:14 +02:00
Tomasz Grabiec
6548132423 lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
2016-10-18 11:16:08 +02:00
Tomasz Grabiec
ecf85cbffb mutation: Define + operation
It's more convenient to write m1 + m2 in tests than to do more
elaborate constructs with copy constructors and apply().
2016-10-18 11:16:08 +02:00
Tomasz Grabiec
fe387f8ba0 partition_version: Fix corruption of partition_version list
The move constructor of partition_version was not invoking move
constructor of anchorless_list_base_hook. As a result, when
partition_version objects were moved, e.g. during LSA compaction, they
were unlinked from their lists.

This can make readers return invalid data, because not all versions
will be reachable.

It also casues leaks of the versions which are not directly attached
to memtable entry. This will trigger assertion failure in LSA region
destructor. This assetion triggers with row cache disabled. With cache
enabled (default) all segments are merged into the cache region, which
currently is not destroyed on shutdown, so this problem would go
unnoticed. With cache disabled, memtable region is destroyed after
memtable is flushed and after all readers stop using that memtable.

Fixes #1753.
Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>
2016-10-18 09:25:38 +01:00
Duarte Nunes
1d45f19c78 create_view_statement: Use cf_properties
This patch uses cf_properties instead to add the missing attributes to
the create_view_statement class.

Fixes #1766

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
7c58b7e764 unimplemented: Add materialized views
This patch adds the VIEWS element to the cause enum so we can
mark failures due to incomplete support of materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
7c28ed3dfc schema: Extract default compressor
This patch extracts the definition of the default compressor into the
compression_parameters class, so that the table and view creation
statements don't have to explicitly deal with it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
dc470e6a36 cql3: Extract cf_properties
This patch extracts the cf_properties class, which contains common
attributes of tables and materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:51 +00:00
Takuya ASADA
587d375e19 main: exit with 1 when verify_seastar_io_scheduler() failed
Since we are exiting Scylla process in engine().at_exit() using
::_exit(0), even verify_seastar_io_scheduler() throwing an exception,
scylla always exit with 0.

Systemd misunderstands scylla-server.service was shutdown successfully
because of this, so we need to pass correct exit code to ::_exit() here.

Fixes #1674

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>
2016-10-17 13:57:00 +03:00
Avi Kivity
163088c6af Merge seastar upstream
* seastar 207bf3d...ccd8649 (3):
  > Merge "Augment semaphore with non-blocking operations" from Glauber
  > Merge "More dynamic fstream patches" from Paweł
  > Merge "fstream: add dynamic adjustments based on stream history" from Paweł
2016-10-17 12:49:17 +03:00
Avi Kivity
65c27ccf21 bytes_ostream: make max_chunk_size() an inline function
Fixes debug build looking for a variable definition and not finding it.
2016-10-17 11:49:33 +03:00
Avi Kivity
c0a1ad0b77 bytes_ostream: use larger allocations
A 1MB response will require 2000 allocations with the current 512-byte
chunk size.  Increase it exponentially to reduce allocation count for
larger responses (still respecting the upper limit).
Message-Id: <1476369152-1245-1-git-send-email-avi@scylladb.com>
2016-10-16 10:05:48 +01:00
Tomasz Grabiec
d836e8f64b tests: memtable: Add tests for flushing reader
Message-Id: <1476454187-11462-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:11:06 +01:00
Tomasz Grabiec
63784fd921 db: Fix corruption of partition_entry
Memory accounting code was attaching partition_snapshot to
partition_entry in order to calculate the size of partition_version
object. However, it is only allowed if partition_entry doesn't have
any snapshot attached already. In this case it always has one, created
by the flushing reader.

Change the accounting code to reuse existing partition_snapshot reference.

Fixes #1746
Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:10:48 +01:00
Paweł Dziepak
d08cffd3c7 lsa: avoid exceptions during segment_zone creation
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.

This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.

Fixes #1394.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>
2016-10-14 11:08:24 +02:00
Amnon Heiman
7829da13b4 scylla_setup: Reorder questions and actions
The expected behaviour in the scylla_setup script is that a question
will be followed by the answer.

For example, after asking if the scylla should be run as a service the
relevant actions will be taken before the following question.

This patch address two such mis-orders:
1. the scylla-housekeeping depends on the scylla-server, but the
setup should first setup the scylla-server service and only then ask
(and install if needed) the scylla-housekeeping.
2. The node_exporter should be placed after the io_setup is done.

Fixes #1739

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1476370098-25617-1-git-send-email-amnon@scylladb.com>
2016-10-13 18:29:36 +03:00
Pekka Enberg
3b4e6cdc5e abstract_replication_strategy: Fix exception type if class not found
Change abstract_replication_strategy::create_replication_strategy() to
throw exceptions::configuration_error if replication strategy class
lookup to make sure the error is converted to the correct CQL response.

Fixes #1755

Message-Id: <1476361262-28723-1-git-send-email-penberg@scylladb.com>
2016-10-13 17:39:28 +03:00
Tomasz Grabiec
e617bcd8a7 logalloc: disable abort on allocation failure in places in which it is benign
Some places start big expecting allocation failure, then reduce the
requested size. Let's not abort in such cases.

Message-Id: <1476295120-32047-1-git-send-email-tgrabiec@scylladb.com>
2016-10-13 10:53:32 +03:00
Avi Kivity
13e9d4c8e3 Merge seastar upstream
* seastar f937fb0...207bf3d (11):
  > Merge "iotune: gracefully exit on predictable exceptions" (Fixes #1623)
  > core/semaphore: Add semaphore_units::release()
  > Merge "rometheus API with grafana uses labels" from Amnon
  > core/thread: Fix stack alloc-dealloc mismatch
  > core/thread: Make jmp_buf_link::yield_at use the same time point as thread_scheduling_group
  > file: support for XFS on older kernels
  > reactor: fix bug when handling EBADF in flush_pending_aio()
  > prometheus CPU should start in 0
  > Collectd: bytes ordering depends on the type
  > tests: Check that backtrace() doesn't corrupt signal mask
  > core/thread: Add stack guards to seastar thread stacks
2016-10-12 23:47:12 +03:00
Avi Kivity
63f053e9b7 storage_proxy: fix mutation reordering with wrapping ranges
If we have a range query involving a wrapping range (i.e., from thrift),
and mutations from both halves of the result are involved, then
we will return the results in the wrong order (and potentially the wrong
partitions) since we order by token, so the results from the second half
of the wrapping range end up before the first.

Fix by splitting the two queries, and merging the second half with lower
priority compared to the first half.

Note: this will be fixed in a better way once we have the sharding iterator,
as then we can query sequentially.

Fixes #1761.
Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>
2016-10-12 15:59:16 +02:00
Avi Kivity
1506b06617 Merge "node_exporter service on ubuntu 16" from Amnon
"This series address two issues that interfere with running the node_exporter as a service in ubuntu 16.
1. The service file should be packed in the deb file
2. When setting the node_exporter as a service it doesn't need to run with scylla use"

* 'amnon/node_exporter_ubuntu_v2' of github.com:cloudius-systems/seastar-dev:
  node-exporter service: No need to run as scylla user
  debian package: Include the node_exporter service file
2016-10-12 12:11:18 +03:00
Amnon Heiman
1bd50789e0 node-exporter service: No need to run as scylla user
the node-exporter does not need to run as scylla user.  It can run
without scylla or without the scylla user being configure.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-11 12:44:27 +03:00
Amnon Heiman
d523bf56ed debian package: Include the node_exporter service file
This will include the node_exporter service script for ubuntu
distribution with systemd support.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-11 12:44:14 +03:00
Avi Kivity
f6998bb260 Merge "Implement describe_splits_ex based on Cassandra" from Duarte
"This patch-set re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693"

* 'describe-splits/v2' of github.com:duarten/scylla:
  thrift: Implement describe_splits_ex based on Cassandra
  storage_service: Implement get_splits() function
  sstables: Add function to get key samples
  sstables/key: Add to_partition_key function
  size_estimates_recorder: Increase estimate accuracy
  sstables: Get estimates for a particular range
  sstables/key: Make key::kind public
2016-10-11 11:13:35 +03:00
Takuya ASADA
0007f2d838 dist/common/sbin: add scylla_cpuset_setup and scylla_dev_mode_setup to /usr/sbin
We haven't added symlinks to /usr/sbin for newly created scripts, so add them.

Fixes #1702

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1474879711-31793-1-git-send-email-syuu@scylladb.com>
2016-10-11 11:02:14 +03:00
Takuya ASADA
ccad720bb1 dist/common/script/scylla_io_setup: handle comma correctly when parsing cpuset
The script mistakenly split value at "," when cpuset list is separated
by comma. Instead of matching possible patterns of the argument, let's
pass all characters until reach to space delimiter or end of line.

Fixes #1716

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476171037-32373-1-git-send-email-syuu@scylladb.com>
2016-10-11 10:42:32 +03:00
Duarte Nunes
d8cfc56376 thrift: Implement describe_splits_ex based on Cassandra
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 22:32:10 +02:00
Duarte Nunes
01ab2081cd storage_service: Implement get_splits() function
This patch implements the get_splits() function in storage_service,
used to split a particular token range in slices of approximately the
specified size, using the sample keys and estimates of the CF's
sstables.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 22:32:08 +02:00
Duarte Nunes
c36dbaf0f1 sstables: Add function to get key samples
This patch implements the get_key_samples() function, on which a
future patch will base an implementation of the describe_splits()
thrift verb closer to Cassandra's.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:14 +02:00
Duarte Nunes
fc07b66678 sstables/key: Add to_partition_key function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:11 +02:00
Duarte Nunes
c19c633299 size_estimates_recorder: Increase estimate accuracy
This patch uses the estimated_keys_for_range() function to get better
estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:16 +02:00
Duarte Nunes
ceed09b23e sstables: Get estimates for a particular range
This patch adds the estimated_keys_for_range() function, which
estimates the number of keys present between the specified range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:15 +02:00
Duarte Nunes
8c223b31c8 sstables/key: Make key::kind public
Needed to create synthetic keys without any value but with ordering
properties.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:47:24 +02:00
Avi Kivity
b305d92a65 Merge "housekeeping: check version during setup" from Amnon
"The version is taken from the installation rather than the API, a mode command
line indicated that this is part of the setup and uuid is used for the
interaction with the checkversion server."

* 'amnon/check_version_on_startup_v3' of github.com:cloudius-systems/seastar-dev:
  scylla_setup: Check and report the scylla version
  scylla-housekeeping: check version during setup
2016-10-10 16:37:14 +03:00
Vlad Zolotarov
ab748e829d docs: tracing.md: initial commit
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1475686745-20383-1-git-send-email-vladz@cloudius-systems.com>
2016-10-10 16:12:02 +03:00
Tomasz Grabiec
4357d0a6d9 db: Add counter for writes blocked on dirty memory
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.

total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.

Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
2016-10-10 14:25:22 +03:00
Pekka Enberg
3b75ff1496 docs/docker: Tag --listen-address as 1.4 feature
The Docker Hub documentation is the same for all image versions. Tag
`--listen-address` as 1.4 feature.

Message-Id: <1475819164-7865-1-git-send-email-penberg@scylladb.com>
2016-10-10 13:26:16 +03:00
Vlad Zolotarov
006999f46c api::storage_service::slow_query: don't use duration_cast in GET
The slow_query_record_ttl() and slow_query_threshold() return the duration
of the appropriate type already - no need for an additional cast.
In addition there was a mistake in a cast of ttl.

Fixes #1734

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1475669400-5925-1-git-send-email-vladz@cloudius-systems.com>
2016-10-09 18:09:13 +03:00
Takuya ASADA
469e9af1f4 dist/common/scripts/scylla_setup: use 'swapon -s' instead of 'swapon --show'
Since Ubuntu 14.04 doesn't supported --show option, we need to prevent use it.
Fixes #1740

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475788340-22939-2-git-send-email-syuu@scylladb.com>
2016-10-09 18:05:14 +03:00
Takuya ASADA
8452045b85 dist/ubuntu: add realpath to dependency, requires for scylla_setup
We need dependency to realpath, since scylla_setup using it.

Fixes #1740.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475788340-22939-1-git-send-email-syuu@scylladb.com>
2016-10-09 18:05:14 +03:00
Tomasz Grabiec
41e66ebce2 gdb: Introduce 'scylla heapprof'
Presents current heap profile recording.

Works in text mode or dumps to collapsed stacks format from which
flame graph can be generated.

To generate a flamegraph:

  (gdb) scylla heapprof --flame
  Wrote heapprof.stacks

  $ flamegraph.pl --colors mem < heapprof.stacks > heapprof.svg

flamegraph.pl comes from:

  https://github.com/brendangregg/FlameGraph.git

Text mode example:

(gdb) scylla heapprof --min 100000000
All (274699676, #10213)
 \-- void* memory::cpu_pages::allocate_large_and_trim<memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}>(unsigned int, memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}) + 169  (268435456, #1)
     memory::allocate_large_aligned(unsigned long, unsigned long) + 87
     memory::allocate_aligned(unsigned long, unsigned long) + 48
     aligned_alloc + 9
     logalloc::segment_zone::segment_zone() + 304
     logalloc::segment_pool::allocate_segment() + 477
     logalloc::segment_pool::segment_pool() + 304
     __tls_init.part.801 + 72
     logalloc::region_group::release_requests() + 1333
     logalloc::region_group::add(logalloc::region_group*) + 514

The branches are formatted like this:

   -- <symbol> (<size>, #<count>)

Where <size> is total size of live objects and <count> is total
number of live objects, for all objects allocated from paths going
through this node.

Nodes which share the same <size> and <count> are stacked like this:

   -- <symbol_1> (<size>, #<count>)
      <symbol_2>
      <symbol_3>

Message-Id: <1475583334-19524-1-git-send-email-tgrabiec@scylladb.com>
2016-10-09 10:54:08 +03:00
Glauber Costa
33e9c2bbdd memtable: reduce sstable flush concurrency to one
Limiting the concurrency of memtable flushes to 4 was a temporary
workaround for the fact that we lacked good write behind support. Now
that write behind is properly merged we can reduce the concurrency to
what it should be, one.

This means that memtable flushes will now be serialized, and only when
one of them ends will the next one begin. Disk parallelism is obtained
through the write-behind mechanism.

Fixes #1373

Signed-off-by: Glauber Costa <glauber@scylladb.com>

Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>
2016-10-09 10:48:57 +03:00
Tomasz Grabiec
2a5a90f391 db: Do not timeout streaming readers
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.

This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.

The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.

Fixes #1741.

Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
2016-10-07 15:41:04 +03:00
Raphael S. Carvalho
9175977a9d cql3: fix build failure by defining out unused function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <cba6207278ea945ee750d78b189320443843a288.1475793747.git.raphaelsc@scylladb.com>
2016-10-07 08:45:18 +03:00
Avi Kivity
9ac441d3b5 range: adjust split_after to allow split_point outside input range
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range.  If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.

"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.

This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
2016-10-06 17:54:44 +02:00
Raphael S. Carvalho
7ea4513595 database: trigger compaction after loading new sstables
Scylla wasn't trying to compact new sstables uploaded via 'nodetool
refresh'. Thus, all new sstables were left uncompacted until user
issued 'nodetool flush' or a new sstable was written which would
trigger compaction too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:49 +03:00
Raphael S. Carvalho
9c59ccc52a storage_service: improve log message for refresh
'No new SSTables were found for keyspace1.standard1' was printed
if user uploaded new sstables to upload dir instead, and that is
confusing. We should instead print that if new sstables weren't
found in both cf and cf/upload dirs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <90386f6255407697434213227ae7ff0de7464f99.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:32 +03:00
Raphael S. Carvalho
76862d0d9c main: start compaction procedure after commit log is replayed
Commit log replay is a synchronous operation in bootstrap, so services
will only be started after it's completed. By starting compaction before,
less bandwidth will be available to both and consequently boot will be
slowed down. Fix is simply about moving compaction, which is an
asynchronous operation after commitlog replay is over.

Fixes #1620.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d2a173a4ee4d474317b970c6b39530e61067fea9.1475527955.git.raphaelsc@scylladb.com>
2016-10-06 18:25:24 +03:00
Nadav Har'El
ee7ec10b11 CQL parser: "CREATE MATERIALIZED VIEW" statement
This patch adds the parsing for the "CREATE MATERIALIZED VIEW" statement,
following Cassandra 3 syntax. For example:

   CREATE MATERIALIZED VIEW building_by_city
   AS SELECT * FROM buildings
   WHERE city IS NOT NULL
   PRIMARY KEY(city, name);

It also adds the "IS NOT NULL" operator needed for this purpose.
As in Cassandra, "IS NOT NULL" can only be used for materialized
view creation, and not in a normal SELECT. It can only be used with
the NULL operand (i.e., "IS NOT 3" will be a syntax error).

The current implementation of this statement just does some sanity
checking (such as to verify that "city" is a valid column name and that
the "building" base table exists), complains that materialized views are
not yet supported:

SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="Failed parsing statement: [CREATE MATERIALIZED VIEW building_by_city AS
SELECT * FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);] reason: unsupported operation: Materialized views not yet supported">

As mentioned above, the "IS NOT NULL" restriction is not allowed in
ordinary selects not creating a materialized views:

SELECT * FROM buildings WHERE city IS NOT NULL;
InvalidRequest: code=2200 [Invalid query] message="restriction 'city IS NOT null' is only supported in materialized view creation"

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1475742927-30695-1-git-send-email-nyh@scylladb.com>
2016-10-06 15:42:37 +03:00
Glauber Costa
7146776d7c fix sstable tests by not using the flush_reader if no region_group
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.

Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.

If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.

One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.

The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
2016-10-05 12:44:21 +01:00
Avi Kivity
c94fb1bf12 build: reduce inclusions of messaging_service.hh
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.

Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
2016-10-05 11:46:49 +03:00
Avi Kivity
f8118d9fc2 Merge "Virtual dirty memory management" from Glauber
"Description:
============

Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Results
=======

With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.

In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.

For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ

Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed.  I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.

Known Issues:
-------------

A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.

After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."

* 'virtual-dirty-v6' of github.com:glommer/scylla:
  database: allow virtual dirty memory management
  streamed_mutation: make _buffer private
  add accounting of memory read to partition_snapshot_reader
  move partition_snapshot_reader code to header file
  LSA: allow a group to query its own region group
  memtables: split scanning reader in two
  sstables: use special reader for writing a memtable
  LSA: export information about object memory footprint
  LSA: export information about size of the throttle queue
  database: export virtual dirty bytes region group
2016-10-04 20:57:52 +03:00
Avi Kivity
cc33c8b4ba Merge seastar upstream
* seastar 18f7bb8...f937fb0 (5):
  > Merge "Fix signal mask corruption" from Tomasz
  > core/memory: Avoid violating strict aliasing when accessing allocation sites
  > core/memory: Avoid indirection when storing allocation sites
  > core/memory: Add a way to disable abort on allocation failure in some scope
  > core/sharded: Allow mapper to take the service by non-const reference
2016-10-04 20:08:57 +03:00
Glauber Costa
f89a67c75c database: allow virtual dirty memory management
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
7b6e8a2526 streamed_mutation: make _buffer private
It is currently protected, but now all users go through
push_mutation_fragment().  So we can safely move its visibility to guarantee
that it stays that way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
1db245b52d add accounting of memory read to partition_snapshot_reader
By default, we don't do any accounting. By specializing this class and providing
an accounter class, we can account how much memory are we reading as we read
through the elements.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
452eb95943 move partition_snapshot_reader code to header file
This is so we can template it without worrying about declaring the
specializations in the .cc file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
86aa0b830d LSA: allow a group to query its own region group
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
eee15578fb memtables: split scanning reader in two
The code that is common will live in its own reader, the iterator_reader.  All
friendly private access to memtable attributes and methods happen through the
iterator reader.

After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
16886eeb96 sstables: use special reader for writing a memtable
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
28e3f2f6ee LSA: export information about object memory footprint
We allocate objects of a certain size, but we use a bit more memory to hold
them.  To get a clerer picture about how much memory will an object cost us, we
need help from the allocator. This patch exports an interface that allow users
to query into a specific allocator to get that information.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Pekka Enberg
c3bebea1ef dist/docker: Add '--listen-address' to 'docker run'
Add a '--listen-address' command line parameter to the Docker image,
which can be used to set Scylla's listen address.

Refs #1723

Message-Id: <1475485165-6772-1-git-send-email-penberg@scylladb.com>
2016-10-04 13:57:55 +03:00
Marius
876775a52c dist/docker/ubuntu: refactored $IP/listen_address
In order to allow Scylla’s docker container to handle multiple network
interfaces, the start-scylla script was refactored:

- `$IP` is now called `$SCYLLA_LISTEN_ADDRESS`, so it is less likely to
   be confused or interfere with other environment variables.
- `$SCYLLA_LISTEN_ADDRESS` now checks its value and also tries to
   resolve a hostname, if no IP was set to it.
- `$SCYLLA_LISTEN_DEVICE` can now be set as environment variable and
   contain any available NIC device name (e.g. `eth0`). The script
   automatically retrieves the IP address from the device.

Usage:

1. With `$SCYLLA_LISTEN_ADDRESS` as IP:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=192.168.1.100 scylladb/scylla`

2. With `$SCYLLA_LISTEN_ADDRESS` as hostname:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=containername.network.lan scylladb/scylla`

3. With `$SCYLLA_LISTEN_DEVICE`:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_DEVICE=eth0 scylladb/scylla`

Message-Id: <20161003151230.67672-1-marius@twostairs.com>
2016-10-04 13:56:55 +03:00
Raphael S. Carvalho
747b42299c database: remove unused code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>
2016-10-04 09:26:43 +03:00
Paweł Dziepak
7599ef6fde query_pager: fix splitting range at the end bound
Currently, the code responsible for calculating ranges for the next
request could produce a wrap-around partition range. For example, if the
original range was (unimportant, A] and the last partition key A then
the output range would be (A, A].

This patch adds checks to make sure that in such cases the range is
removed.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475497244-2790-1-git-send-email-pdziepak@scylladb.com>
2016-10-03 19:33:42 +02:00
Avi Kivity
8747054d10 exceptions: mark function called before construction static
cassandra_exception::prepare_message() is called from derived classes'
constructors before the base cassnadra_exception object is constructed.
This is technically illegal but harmless.  Fix by marking the function
static.

Found by clang.
2016-10-03 16:29:02 +03:00
Calle Wilund
5b815b81b4 auth::password_authenticator: Ensure exceptions are processed in continuation
Fixes #1718 (even more)
Message-Id: <1475497389-27016-1-git-send-email-calle@scylladb.com>
2016-10-03 14:49:59 +02:00
Pekka Enberg
f3cd21c8f1 Merge seastar upstream
* seastar 0e60722...18f7bb8 (1):
  > core/memory: Fix compilation errors
2016-10-03 12:54:38 +03:00
Calle Wilund
d24d0f8f90 auth::password_authenticator: "authenticate" should not throw undeclared excpt
Fixes #1718

Message-Id: <1475487331-25927-1-git-send-email-calle@scylladb.com>
2016-10-03 12:53:30 +03:00
Avi Kivity
a51804eca8 Merge "token_restriction: Deal with minimum tokens" from Duarte
"This patch set ensures we can correctly handle queries
where the minimum token is specified."

* 'min-token/v3' of github.com:duarten/scylla:
  cql_query_test: Add test case for min/max token bounds
  token_restriction: Deal with minimum tokens
  partitioner: Parse token from bytes
2016-10-02 12:32:40 +03:00
Avi Kivity
5071f4c0bf Merge seastar upstream
* seastar 9e1d5db...0e60722 (9):
  > core/memory: Replace assert with bad_alloc in allocate_large()
  > chunked_fifo: avoid direct use of sized operator delete
  > memory: fix build without heap profiler
  > xen: initialize port::_sem
  > Merge "Make input streams skippable" from Paweł
  > semaphore: require explict setting for start value
  > prometheus: remove invalid chars from meric names
  > core/memory: Introduce heap profiler
  > util/backtrace: Mark noexcept if func() doesn't throw
2016-10-02 11:43:22 +03:00
Vlad Zolotarov
7e180c7bd3 tracing: introduce the tracing::global_trace_state_ptr class
This object, similarly to a global_schema_ptr, allows to dynamically
create the trace_state_ptr objects on different shards in a context
of the original tracing session.

This object would create a secondary tracing session object from the
original trace_state_ptr object when a trace_state_ptr object is needed
on a "remote" shard, similarly to what we do when we need it on a remote
Node.

Fixes #1678
Fixes #1647

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>
2016-10-02 11:31:37 +03:00
Amnon Heiman
a83bd900be scylla_setup: Check and report the scylla version
This patch adds a call to the scylla-housekeeping check version during
setup, so a warning will be printed if a newer version is available.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-02 11:11:07 +03:00
Amnon Heiman
5e3ab32365 scylla-housekeeping: check version during setup
This changes are for running scylla during setup.

It contains the following changes:
1. get the current version from the command line (as the syclla does not
run at this stage).
2. It support a mode parameter in the command line to indicate that we
running during the installation.
3. It accept an external uuid that will be used with all interaction
with the check_version server.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-02 11:11:07 +03:00
Takuya ASADA
15b156c9d4 dist/common/scripts/scylla_io_setup: describe how to set developer mode when validation tests failed
Describe how to set developer mode, not to confuse users.
Fixes #1701

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475167584-18092-1-git-send-email-syuu@scylladb.com>
2016-10-02 10:58:38 +03:00
Avi Kivity
58ddfea18f Merge "Fixes for leveled compaction strategy" from Raphael
* 'lcs_fixes' of github.com:raphaelsc/scylla:
  lcs: fix starvation at higher levels
  lcs: fix broken token range distribution at higher levels
2016-10-01 21:34:21 +03:00
Takuya ASADA
9639cc840e dist/redhat: add missing build time dependency for libunwind
There was missing dependency for libunwind, so add it.
Fixes #1722

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475260099-25881-1-git-send-email-syuu@scylladb.com>
2016-09-30 21:33:39 +03:00
Takuya ASADA
c89d9599b1 dist/ubuntu: add missing build time dependency for libunwind
There was missing dependency for libunwind, so add it.
Fixes #1721

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475255706-26434-1-git-send-email-syuu@scylladb.com>
2016-09-30 21:33:21 +03:00
Raphael S. Carvalho
a8ab4b8f37 lcs: fix starvation at higher levels
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)

Fixes #1720.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:49 -03:00
Raphael S. Carvalho
a3bf7558f2 lcs: fix broken token range distribution at higher levels
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.

Fixes #1719.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:16 -03:00
Paweł Dziepak
eb1fcf3ecc query_pagers: fix clustering key range calculation
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.

Refs #1446.
Fixes #1684.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
2016-09-30 17:32:59 +02:00
Tomasz Grabiec
7e25b958ac transport: Extend request memory footprint accounting to also cover execution
CQL server is supposed to throttle requests so that they don't
overflow memory. The problem is that it currently accounts for
request's memory only around reading of its frame from the connection
and not actual request execution. As a result too many requests may be
allowed to execute and we may run out of memory.

Fixes #1708.
Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>
2016-09-30 14:23:14 +01:00
Duarte Nunes
72af476397 cql_query_test: Add test case for min/max token bounds
This patch adds a test case for specifying the minimum and maximum
tokens in a cql3 query.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:45:45 +00:00
Duarte Nunes
98b4814894 token_restriction: Deal with minimum tokens
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref #1139
Ref #693
Fixes #1717

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:08 +00:00
Duarte Nunes
862f51cddf partitioner: Parse token from bytes
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:02 +00:00
Duarte Nunes
0c8f280af7 partition_key_view: Implement operator<<
The operator is declared, but it isn't implemented. This patch fixes
that.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1475225647-3800-1-git-send-email-duarte@scylladb.com>
2016-09-30 10:54:54 +02:00
Duarte Nunes
a36888f3cb storage_service: Convert token through partitioner
This patch ensures we use the partitioner to convert a token to
sstring instead of casting.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1475179683-28552-1-git-send-email-duarte@scylladb.com>
2016-09-30 10:54:26 +02:00
Glauber Costa
f5fd6bd714 LSA: export information about size of the throttle queue
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.

This counter is named for consistency after similar counters from transport/.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
Glauber Costa
aa6a96d09b database: export virtual dirty bytes region group
Currently, we export the region group where memtables are placed as dirty bytes.
Upcoming patches will optimistically mark some bytes in this region as free, a
scheme we know as "virtual dirty".

We are still interested in knowing the real state of the dirty region, so we
will keep track of the bytes virtually freed and split the counters in two.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
392 changed files with 12618 additions and 6450 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

11
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,11 @@
# Asking questions or requesting help
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) for general questions and help.
# Reporting an issue
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
# Contributing Code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](http://www.scylladb.com/opensource/cla/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.

View File

@@ -83,14 +83,6 @@ Run the image with:
docker run -p $(hostname -i):9042:9042 -i -t <image name>
```
## Contributing to Scylla
Do not send pull requests.
Send patches to the mailing list address scylladb-dev@googlegroups.com.
Be sure to subscribe.
In order for your patches to be merged, you must sign the Contributor's
License Agreement, protecting your rights and ours. See
http://www.scylladb.com/opensource/cla/.
[Guidelines for contributing](CONTRIBUTING.md)

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=1.6.6
if test -f version
then

View File

@@ -397,6 +397,36 @@
}
]
},
{
"path": "/cache_service/metrics/key/hits_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get key hits moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_key_hits_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/key/requests_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get key requests moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_key_requests_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/key/size",
"operations": [
@@ -607,6 +637,36 @@
}
]
},
{
"path": "/cache_service/metrics/counter/hits_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get counter hits moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_counter_hits_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/counter/requests_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get counter requests moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_counter_requests_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/counter/size",
"operations": [

View File

@@ -78,11 +78,19 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"split_output",
"description":"true if the output of the major compaction should be split in several sstables",
"required":false,
"allowMultiple":false,
"type":"bool",
"paramType":"query"
}
]
}
@@ -102,7 +110,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -129,7 +137,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -153,7 +161,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -180,7 +188,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -204,7 +212,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -244,7 +252,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -271,7 +279,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -298,7 +306,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -317,7 +325,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -349,7 +357,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -381,7 +389,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -405,7 +413,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -432,7 +440,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -459,7 +467,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -491,7 +499,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -518,7 +526,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -545,7 +553,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -569,7 +577,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -593,7 +601,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -633,7 +641,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -673,7 +681,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -713,7 +721,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -753,7 +761,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -793,7 +801,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -833,7 +841,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -873,7 +881,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -916,7 +924,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -943,7 +951,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -970,7 +978,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -994,7 +1002,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1034,7 +1042,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1058,7 +1066,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1101,7 +1109,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1144,7 +1152,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1203,7 +1211,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1243,7 +1251,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1267,7 +1275,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1310,7 +1318,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1353,7 +1361,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1412,7 +1420,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1452,7 +1460,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1492,7 +1500,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1532,7 +1540,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1572,7 +1580,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1612,7 +1620,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1652,7 +1660,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1692,7 +1700,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1732,7 +1740,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1772,7 +1780,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1812,7 +1820,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1852,7 +1860,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1892,7 +1900,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1932,7 +1940,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1972,7 +1980,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2012,7 +2020,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2052,7 +2060,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2092,7 +2100,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2116,7 +2124,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2156,7 +2164,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2196,7 +2204,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2236,7 +2244,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2276,7 +2284,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2300,7 +2308,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2324,7 +2332,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2351,7 +2359,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2378,7 +2386,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2405,7 +2413,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2432,7 +2440,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2501,7 +2509,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2525,7 +2533,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2549,7 +2557,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2573,7 +2581,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2597,7 +2605,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2621,7 +2629,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2645,7 +2653,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2669,7 +2677,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2693,7 +2701,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2717,7 +2725,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2741,7 +2749,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2765,7 +2773,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",

View File

@@ -21,8 +21,8 @@
"parameters":[
{
"name":"host",
"description":"The host name",
"required":true,
"description":"The host name. If absent, the local server broadcast/listen address is used",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
@@ -45,8 +45,8 @@
"parameters":[
{
"name":"host",
"description":"The host name",
"required":true,
"description":"The host name. If absent, the local server broadcast/listen address is used",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"

View File

@@ -42,6 +42,25 @@
}
]
},
{
"path":"/failure_detector/endpoint_phi_values",
"operations":[
{
"method":"GET",
"summary":"Get end point phi values",
"type":"array",
"items":{
"type":"endpoint_phi_values"
},
"nickname":"get_endpoint_phi_values",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/failure_detector/endpoints/",
"operations":[
@@ -202,6 +221,20 @@
"description": "The application state version"
}
}
},
"endpoint_phi_value": {
"id" : "endpoint_phi_value",
"description": "Holds phi value for a single end point",
"properties": {
"phi": {
"type": "double",
"description": "Phi value"
},
"endpoint": {
"type": "string",
"description": "end point address"
}
}
}
}
}

View File

@@ -777,7 +777,7 @@
]
},
{
"path": "/storage_proxy/metrics/read/moving_avrage_histogram",
"path": "/storage_proxy/metrics/read/moving_average_histogram",
"operations": [
{
"method": "GET",
@@ -792,7 +792,7 @@
]
},
{
"path": "/storage_proxy/metrics/range/moving_avrage_histogram",
"path": "/storage_proxy/metrics/range/moving_average_histogram",
"operations": [
{
"method": "GET",
@@ -942,7 +942,7 @@
]
},
{
"path": "/storage_proxy/metrics/write/moving_avrage_histogram",
"path": "/storage_proxy/metrics/write/moving_average_histogram",
"operations": [
{
"method": "GET",

View File

@@ -1201,11 +1201,12 @@
],
"parameters":[
{
"name":"non_system",
"description":"When set to true limit to non system",
"name":"type",
"description":"Which keyspaces to return",
"required":false,
"allowMultiple":false,
"type":"boolean",
"type":"string",
"enum": [ "all", "user", "non_local_strategy" ],
"paramType":"query"
}
]

View File

@@ -166,33 +166,36 @@ inline int64_t max_int64(int64_t a, int64_t b) {
* It combine total and the sub set for the ratio and its
* to_json method return the ration sub/total
*/
struct ratio_holder : public json::jsonable {
double total = 0;
double sub = 0;
template<typename T>
struct basic_ratio_holder : public json::jsonable {
T total = 0;
T sub = 0;
virtual std::string to_json() const {
if (total == 0) {
return "0";
}
return std::to_string(sub/total);
}
ratio_holder() = default;
ratio_holder& add(double _total, double _sub) {
basic_ratio_holder() = default;
basic_ratio_holder& add(T _total, T _sub) {
total += _total;
sub += _sub;
return *this;
}
ratio_holder(double _total, double _sub) {
basic_ratio_holder(T _total, T _sub) {
total = _total;
sub = _sub;
}
ratio_holder& operator+=(const ratio_holder& a) {
basic_ratio_holder<T>& operator+=(const basic_ratio_holder<T>& a) {
return add(a.total, a.sub);
}
friend ratio_holder operator+(ratio_holder a, const ratio_holder& b) {
friend basic_ratio_holder<T> operator+(basic_ratio_holder a, const basic_ratio_holder<T>& b) {
return a += b;
}
};
typedef basic_ratio_holder<double> ratio_holder;
typedef basic_ratio_holder<int64_t> integral_ratio_holder;
class unimplemented_exception : public base_exception {
public:

View File

@@ -177,6 +177,20 @@ void set_cache_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cs::get_key_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_key_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_key_size.set(r, [] (std::unique_ptr<request> req) {
// TBD
// FIXME
@@ -194,7 +208,7 @@ void set_cache_service(http_context& ctx, routes& r) {
});
cs::get_row_capacity.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {
return cf.get_row_cache().get_cache_tracker().region().occupancy().used_space();
}, std::plus<uint64_t>());
});
@@ -280,6 +294,20 @@ void set_cache_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cs::get_counter_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_counter_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_counter_size.set(r, [] (std::unique_ptr<request> req) {
// TBD
// FIXME

View File

@@ -191,8 +191,8 @@ static double update_ratio(double acc, double f, double total) {
return acc;
}
static ratio_holder mean_row_size(column_family& cf) {
ratio_holder res;
static integral_ratio_holder mean_row_size(column_family& cf) {
integral_ratio_holder res;
for (auto i: *cf.get_sstables() ) {
auto c = i->get_stats_metadata().estimated_row_size.count();
res.sub += i->get_stats_metadata().estimated_row_size.mean() * c;
@@ -562,11 +562,13 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), mean_row_size, std::plus<ratio_holder>());
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, req->param["name"], integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());
});
cf::get_all_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, ratio_holder(), mean_row_size, std::plus<ratio_holder>());
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());
});
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

View File

@@ -22,16 +22,22 @@
#include "locator/snitch_base.hh"
#include "endpoint_snitch.hh"
#include "api/api-doc/endpoint_snitch_info.json.hh"
#include "utils/fb_utilities.hh"
namespace api {
void set_endpoint_snitch(http_context& ctx, routes& r) {
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(req.get_query_param("host"));
static auto host_or_broadcast = [](const_req req) {
auto host = req.get_query_param("host");
return host.empty() ? gms::inet_address(utils::fb_utilities::get_broadcast_address()) : gms::inet_address(host);
};
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(host_or_broadcast(req));
});
httpd::endpoint_snitch_info_json::get_rack.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(req.get_query_param("host"));
httpd::endpoint_snitch_info_json::get_rack.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(host_or_broadcast(req));
});
httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [] (const_req req) {

View File

@@ -88,6 +88,20 @@ void set_failure_detector(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(state);
});
});
fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {
return gms::get_arrival_samples().then([](std::map<gms::inet_address, gms::arrival_window> map) {
std::vector<fd::endpoint_phi_value> res;
auto now = gms::arrival_window::clk::now();
for (auto& p : map) {
fd::endpoint_phi_value val;
val.endpoint = p.first.to_sstring();
val.phi = p.second.phi(now);
res.emplace_back(std::move(val));
}
return make_ready_future<json::json_return_type>(res);
});
});
}
}

View File

@@ -22,6 +22,8 @@
#include "storage_service.hh"
#include "api/api-doc/storage_service.json.hh"
#include "db/config.hh"
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <service/storage_service.hh>
#include <db/commitlog/commitlog.hh>
#include <gms/gossiper.hh>
@@ -457,8 +459,15 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_keyspaces.set(r, [&ctx](const_req req) {
auto non_system = req.get_query_param("non_system");
return map_keys(ctx.db.local().keyspaces());
auto type = req.get_query_param("type");
if (type == "user") {
return ctx.db.local().get_non_system_keyspaces();
} else if (type == "non_local_strategy") {
return map_keys(ctx.db.local().get_keyspaces() | boost::adaptors::filtered([](const auto& p) {
return p.second.get_replication_strategy().get_type() != locator::replication_strategy_type::local;
}));
}
return map_keys(ctx.db.local().get_keyspaces());
});
ss::update_snitch.set(r, [](std::unique_ptr<request> req) {
@@ -684,8 +693,8 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::get_slow_query_info.set(r, [](const_req req) {
ss::slow_query_info res;
res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();
res.ttl = std::chrono::duration_cast<std::chrono::microseconds>(tracing::tracing::get_local_tracing_instance().slow_query_record_ttl()).count() ;
res.threshold = std::chrono::duration_cast<std::chrono::microseconds>(tracing::tracing::get_local_tracing_instance().slow_query_threshold()).count();
res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;
res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();
return res;
});

View File

@@ -63,8 +63,8 @@ public:
::feed_hash(as_collection_mutation(), h, def.type);
}
}
size_t memory_usage() const {
return _data.memory_usage();
size_t external_memory_usage() const {
return _data.external_memory_usage();
}
friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);
};

View File

@@ -243,7 +243,8 @@ future<> auth::auth::setup() {
std::map<sstring, sstring> opts;
opts["replication_factor"] = "1";
auto ksm = keyspace_metadata::new_keyspace(AUTH_KS, "org.apache.cassandra.locator.SimpleStrategy", opts, true);
f = service::get_local_migration_manager().announce_new_keyspace(ksm, false);
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments. See issue #2129.
f = service::get_local_migration_manager().announce_new_keyspace(ksm, api::min_timestamp, false);
}
return f.then([] {
@@ -353,7 +354,7 @@ future<> auth::auth::setup_table(const sstring& name, const sstring& cql) {
parsed->prepare_keyspace(AUTH_KS);
::shared_ptr<cql3::statements::create_table_statement> statement =
static_pointer_cast<cql3::statements::create_table_statement>(
parsed->prepare(db)->statement);
parsed->prepare(db, qp.get_cql_stats())->statement);
auto schema = statement->get_cf_meta_data();
auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());

View File

@@ -47,11 +47,8 @@
const sstring auth::data_resource::ROOT_NAME("data");
auth::data_resource::data_resource(level l, const sstring& ks, const sstring& cf)
: _ks(ks), _cf(cf)
: _level(l), _ks(ks), _cf(cf)
{
if (l != get_level()) {
throw std::invalid_argument("level/keyspace/column mismatch");
}
}
auth::data_resource::data_resource()
@@ -67,14 +64,7 @@ auth::data_resource::data_resource(const sstring& ks, const sstring& cf)
{}
auth::data_resource::level auth::data_resource::get_level() const {
if (!_cf.empty()) {
assert(!_ks.empty());
return level::COLUMN_FAMILY;
}
if (!_ks.empty()) {
return level::KEYSPACE;
}
return level::ROOT;
return _level;
}
auth::data_resource auth::data_resource::from_name(

View File

@@ -56,6 +56,7 @@ private:
static const sstring ROOT_NAME;
level _level;
sstring _ks;
sstring _cf;

View File

@@ -218,12 +218,12 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
// obsolete prepared statements pretty quickly.
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
auto& qp = cql3::get_local_query_processor();
return qp.process(
sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME),
consistency_for_user(username), { username }, true).then_wrapped(
[=](future<::shared_ptr<cql3::untyped_result_set>> f) {
return futurize_apply([this, username, password] {
auto& qp = cql3::get_local_query_processor();
return qp.process(sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME),
consistency_for_user(username), {username}, true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {
@@ -234,6 +234,8 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));
} catch (exceptions::request_execution_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.what()));
} catch (...) {
std::throw_with_nested(exceptions::authentication_exception("authentication failed"));
}
});
}

View File

@@ -40,6 +40,7 @@
*/
#include <unordered_map>
#include <boost/algorithm/string.hpp>
#include "permission.hh"
const auth::permission_set auth::permissions::ALL_DATA =
@@ -75,7 +76,9 @@ const sstring& auth::permissions::to_string(permission p) {
}
auth::permission auth::permissions::from_string(const sstring& s) {
return permission_names.at(s);
sstring upper(s);
boost::to_upper(upper);
return permission_names.at(upper);
}
std::unordered_set<sstring> auth::permissions::to_strings(const permission_set& set) {

View File

@@ -38,7 +38,7 @@ class bytes_ostream {
public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
static constexpr size_type max_chunk_size = 16 * 1024;
static constexpr size_type max_chunk_size() { return 16 * 1024; }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
@@ -59,7 +59,6 @@ private:
};
// FIXME: consider increasing chunk size as the buffer grows
static constexpr size_type chunk_size{512};
static constexpr size_type usable_chunk_size{chunk_size - sizeof(chunk)};
private:
std::unique_ptr<chunk> _begin;
chunk* _current;
@@ -100,6 +99,19 @@ private:
}
return _current->size - _current->offset;
}
// Figure out next chunk size.
// - must be enough for data_size
// - must be at least chunk_size
// - try to double each time to prevent too many allocations
// - do not exceed max_chunk_size
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: chunk_size;
next_size = std::min(next_size, max_chunk_size());
// FIXME: check for overflow?
return std::max<size_type>(next_size, data_size + sizeof(chunk));
}
// Makes room for a contiguous region of given size.
// The region is accounted for as already written.
// size must not be zero.
@@ -110,7 +122,7 @@ private:
_size += size;
return ret;
} else {
auto alloc_size = size <= usable_chunk_size ? chunk_size : (size + sizeof(chunk));
auto alloc_size = next_alloc_size(size);
auto space = malloc(alloc_size);
if (!space) {
throw std::bad_alloc();
@@ -205,7 +217,7 @@ public:
}
while (!v.empty()) {
auto this_size = std::min(v.size(), size_t(max_chunk_size));
auto this_size = std::min(v.size(), size_t(max_chunk_size()));
std::copy_n(v.begin(), this_size, alloc(this_size));
v.remove_prefix(this_size);
}
@@ -329,7 +341,7 @@ public:
// if its size is below max_chunk_size. We probably could also gain
// some read performance by doing "real" reduction, i.e. merging
// all chunks until all but the last one is max_chunk_size.
if (size() < max_chunk_size) {
if (size() < max_chunk_size()) {
linearize();
}
}

View File

@@ -44,7 +44,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
mutation_partition_serializer part_ser(*m.schema(), m.partition());
bytes_ostream out;
ser::writer_of_canonical_mutation wr(out);
ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())

View File

@@ -27,125 +27,125 @@
class checked_file_impl : public file_impl {
public:
checked_file_impl(disk_error_signal_type& s, file f)
: _signal(s) , _file(f) {
checked_file_impl(const io_error_handler& error_handler, file f)
: _error_handler(error_handler), _file(f) {
_memory_dma_alignment = f.memory_dma_alignment();
_disk_read_dma_alignment = f.disk_read_dma_alignment();
_disk_write_dma_alignment = f.disk_write_dma_alignment();
}
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, buffer, len, pc);
});
}
virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, iov, pc);
});
}
virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, buffer, len, pc);
});
}
virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, iov, pc);
});
}
virtual future<> flush(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->flush();
});
}
virtual future<struct stat> stat(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->stat();
});
}
virtual future<> truncate(uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->truncate(length);
});
}
virtual future<> discard(uint64_t offset, uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->discard(offset, length);
});
}
virtual future<> allocate(uint64_t position, uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->allocate(position, length);
});
}
virtual future<uint64_t> size(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->size();
});
}
virtual future<> close() override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->close();
});
}
virtual subscription<directory_entry> list_directory(std::function<future<> (directory_entry de)> next) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->list_directory(next);
});
}
private:
disk_error_signal_type &_signal;
const io_error_handler& _error_handler;
file _file;
};
inline file make_checked_file(disk_error_signal_type& signal, file& f)
inline file make_checked_file(const io_error_handler& error_handler, file& f)
{
return file(::make_shared<checked_file_impl>(signal, f));
return file(::make_shared<checked_file_impl>(error_handler, f));
}
future<file>
inline open_checked_file_dma(disk_error_signal_type& signal,
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags,
file_open_options options)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags, options).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}
future<file>
inline open_checked_file_dma(disk_error_signal_type& signal,
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}
future<file>
inline open_checked_directory(disk_error_signal_type& signal,
inline open_checked_directory(const io_error_handler& error_handler,
sstring name)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return engine().open_directory(name).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}

View File

@@ -54,6 +54,10 @@ public:
// Return a list of sstables to be compacted after applying the strategy.
compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<lw_shared_ptr<sstable>> candidates);
// Some strategies may look at the compacted and resulting sstables to
// get some useful information for subsequent compactions.
void notify_completion(const std::vector<lw_shared_ptr<sstable>>& removed, const std::vector<lw_shared_ptr<sstable>>& added);
// Return if parallel compaction is allowed by strategy.
bool parallel_compaction() const;

View File

@@ -39,6 +39,9 @@ public:
compatible_ring_position(const schema& s, dht::ring_position&& rp)
: _schema(&s), _rp(std::move(rp)) {
}
const dht::token& token() const {
return _rp->token();
}
friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
return x._rp->tri_compare(*x._schema, *y._rp);
}

View File

@@ -39,17 +39,17 @@ public:
static constexpr auto CHUNK_LENGTH_KB = "chunk_length_kb";
static constexpr auto CRC_CHECK_CHANCE = "crc_check_chance";
private:
compressor _compressor = compressor::none;
compressor _compressor;
std::experimental::optional<int> _chunk_length;
std::experimental::optional<double> _crc_check_chance;
public:
compression_parameters() = default;
compression_parameters(compressor c) : _compressor(c) { }
compression_parameters(compressor c = compressor::lz4) : _compressor(c) { }
compression_parameters(const std::map<sstring, sstring>& options) {
validate_options(options);
auto it = options.find(SSTABLE_COMPRESSION);
if (it == options.end() || it->second.empty()) {
_compressor = compressor::none;
return;
}
const auto& compressor_class = it->second;

View File

@@ -409,29 +409,6 @@ partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# the smaller of 1/4 of heap or 512MB.
# file_cache_size_in_mb: 512
# Total permitted memory to use for memtables. Scylla will stop
# accepting writes when the limit is exceeded until a flush completes,
# and will trigger a flush based on memtable_cleanup_threshold
# If omitted, Scylla will set both to 1/4 the size of the heap.
# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048
# Ratio of occupied non-flushing memtable size to total permitted size
# that will trigger a flush of the largest memtable. Lager mct will
# mean larger flushes and hence less compaction, but also less concurrent
# flush activity which can make it difficult to keep your disks fed
# under heavy write load.
#
# memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1)
# memtable_cleanup_threshold: 0.11
# Specify the way Scylla allocates and manages memtable memory.
# Options are:
# heap_buffers: on heap nio buffers
# offheap_buffers: off heap (direct) nio buffers
# offheap_objects: native memory, eliminating nio buffer heap overhead
# memtable_allocation_type: heap_buffers
# Total space to use for commitlogs.
#
# If space gets above this value (it will round up to the next nearest
@@ -443,17 +420,6 @@ partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# available for Scylla.
commitlog_total_space_in_mb: -1
# This sets the amount of memtable flush writer threads. These will
# be blocked by disk io, and each one will hold a memtable in memory
# while blocked.
#
# memtable_flush_writers defaults to the smaller of (number of disks,
# number of cores), with a minimum of 2 and a maximum of 8.
#
# If your data directories are backed by SSD, you should increase this
# to the number of cores.
#memtable_flush_writers: 8
# A fixed memory pool size in MB for for SSTable index summaries. If left
# empty, this will default to 5% of the heap size. If the memory usage of
# all index summaries exceeds this limit, SSTables with low read rates will

View File

@@ -108,6 +108,11 @@ def debug_flag(compiler):
print('Note: debug information disabled; upgrade your compiler')
return ''
def maybe_static(flag, libs):
if flag and not args.static:
libs = '-Wl,-Bstatic {} -Wl,-Bdynamic'.format(libs)
return libs
class Thrift(object):
def __init__(self, source, service):
self.source = source
@@ -184,7 +189,6 @@ scylla_tests = [
'tests/storage_proxy_test',
'tests/schema_change_test',
'tests/mutation_reader_test',
'tests/key_reader_test',
'tests/mutation_query_test',
'tests/row_cache_test',
'tests/test-serialization',
@@ -222,6 +226,8 @@ scylla_tests = [
'tests/database_test',
'tests/nonwrapping_range_test',
'tests/input_stream_test',
'tests/sstable_atomic_deletion_test',
'tests/virtual_reader_test',
]
apps = [
@@ -263,7 +269,9 @@ arg_parser.add_argument('--debuginfo', action = 'store', dest = 'debuginfo', typ
arg_parser.add_argument('--static-stdc++', dest = 'staticcxx', action = 'store_true',
help = 'Link libgcc and libstdc++ statically')
arg_parser.add_argument('--static-thrift', dest = 'staticthrift', action = 'store_true',
help = 'Link libthrift statically')
help = 'Link libthrift statically')
arg_parser.add_argument('--static-boost', dest = 'staticboost', action = 'store_true',
help = 'Link boost statically')
arg_parser.add_argument('--tests-debuginfo', action = 'store', dest = 'tests_debuginfo', type = int, default = 0,
help = 'Enable(1)/disable(0)compiler debug information generation for tests')
arg_parser.add_argument('--python', action = 'store', dest = 'python', default = 'python3',
@@ -299,7 +307,6 @@ scylla_core = (['database.cc',
'mutation_partition_serializer.cc',
'mutation_reader.cc',
'mutation_query.cc',
'key_reader.cc',
'keys.cc',
'sstables/sstables.cc',
'sstables/compress.cc',
@@ -309,6 +316,7 @@ scylla_core = (['database.cc',
'sstables/compaction.cc',
'sstables/compaction_strategy.cc',
'sstables/compaction_manager.cc',
'sstables/atomic_deletion.cc',
'transport/event.cc',
'transport/event_notifier.cc',
'transport/server.cc',
@@ -328,6 +336,7 @@ scylla_core = (['database.cc',
'cql3/statements/authentication_statement.cc',
'cql3/statements/create_keyspace_statement.cc',
'cql3/statements/create_table_statement.cc',
'cql3/statements/create_view_statement.cc',
'cql3/statements/create_type_statement.cc',
'cql3/statements/create_user_statement.cc',
'cql3/statements/drop_keyspace_statement.cc',
@@ -485,7 +494,7 @@ scylla_core = (['database.cc',
'tracing/trace_state.cc',
'range_tombstone.cc',
'range_tombstone_list.cc',
'db/size_estimates_recorder.cc'
'disk-error-handler.cc'
]
+ [Antlr3Grammar('cql3/Cql.g')]
+ [Thrift('interface/cassandra.thrift', 'Cassandra')]
@@ -564,43 +573,49 @@ deps = {
'scylla': idls + ['main.cc'] + scylla_core + api,
}
tests_not_using_seastar_test_framework = set([
'tests/keys_test',
pure_boost_tests = set([
'tests/partitioner_test',
'tests/map_difference_test',
'tests/keys_test',
'tests/compound_test',
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
'tests/nonwrapping_range_test',
'tests/test-serialization',
'tests/range_test',
'tests/crc_test',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/cartesian_product_test',
])
tests_not_using_seastar_test_framework = set([
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
'tests/lsa_sync_eviction_test',
'tests/row_cache_alloc_stress',
'tests/perf_row_cache_update',
'tests/cartesian_product_test',
'tests/perf/perf_hash',
'tests/perf/perf_cql_parser',
'tests/message',
'tests/perf/perf_simple_query',
'tests/memory_footprint',
'tests/test-serialization',
'tests/gossip',
'tests/compound_test',
'tests/range_test',
'tests/crc_test',
'tests/perf/perf_sstable',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
'tests/nonwrapping_range_test',
])
]) | pure_boost_tests
for t in tests_not_using_seastar_test_framework:
if not t in scylla_tests:
raise Exception("Test %s not found in scylla_tests" % (t))
for t in scylla_tests:
deps[t] = scylla_tests_dependencies + [t + '.cc']
deps[t] = [t + '.cc']
if t not in tests_not_using_seastar_test_framework:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc']
@@ -704,6 +719,8 @@ elif args.dpdk_target:
seastar_flags += ['--dpdk-target', args.dpdk_target]
if args.staticcxx:
seastar_flags += ['--static-stdc++']
if args.staticboost:
seastar_flags += ['--static-boost']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_flags += ['--compiler', args.cxx, '--cflags=%s' % (seastar_cflags)]
@@ -737,7 +754,14 @@ for mode in build_modes:
seastar_deps = 'practically_anything_can_change_so_lets_run_it_every_time_and_restat.'
args.user_cflags += " " + pkg_config("--cflags", "jsoncpp")
libs = "-lyaml-cpp -llz4 -lz -lsnappy " + pkg_config("--libs", "jsoncpp") + ' -lboost_filesystem' + ' -lcrypt' + ' -lboost_date_time'
libs = ' '.join(['-lyaml-cpp', '-llz4', '-lz', '-lsnappy', pkg_config("--libs", "jsoncpp"),
maybe_static(args.staticboost, '-lboost_filesystem'), ' -lcrypt',
maybe_static(args.staticboost, '-lboost_date_time'),
])
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
for pkg in pkgs:
args.user_cflags += ' ' + pkg_config('--cflags', pkg)
libs += ' ' + pkg_config('--libs', pkg)
@@ -851,6 +875,11 @@ with open(buildfile, 'w') as f:
f.write('build $builddir/{}/{}: ar.{} {}\n'.format(mode, binary, mode, str.join(' ', objs)))
else:
if binary.startswith('tests/'):
local_libs = '$libs'
if binary not in tests_not_using_seastar_test_framework or binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
# Our code's debugging information is huge, and multiplied
# by many tests yields ridiculous amounts of disk space.
# So we strip the tests by default; The user can very
@@ -858,15 +887,15 @@ with open(buildfile, 'w') as f:
# to the test name, e.g., "ninja build/release/testname_g"
f.write('build $builddir/{}/{}: {}.{} {} {}\n'.format(mode, binary, tests_link_rule, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
if has_thrift:
f.write(' libs = {} -lboost_system $libs\n'.format(thrift_libs))
f.write(' libs = {}\n'.format(local_libs))
f.write('build $builddir/{}/{}_g: link.{} {} {}\n'.format(mode, binary, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
f.write(' libs = {}\n'.format(local_libs))
else:
f.write('build $builddir/{}/{}: link.{} {} {}\n'.format(mode, binary, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
if has_thrift:
f.write(' libs = {} -lboost_system $libs\n'.format(thrift_libs))
if has_thrift:
f.write(' libs = {} {} $libs\n'.format(thrift_libs, maybe_static(args.staticboost, '-lboost_system')))
for src in srcs:
if src.endswith('.cc'):
obj = '$builddir/' + mode + '/' + src.replace('.cc', '.o')

View File

@@ -40,6 +40,7 @@ options {
#include "cql3/statements/drop_keyspace_statement.hh"
#include "cql3/statements/create_index_statement.hh"
#include "cql3/statements/create_table_statement.hh"
#include "cql3/statements/create_view_statement.hh"
#include "cql3/statements/create_type_statement.hh"
#include "cql3/statements/drop_type_statement.hh"
#include "cql3/statements/alter_type_statement.hh"
@@ -340,6 +341,7 @@ cqlStatement returns [shared_ptr<raw::parsed_statement> stmt]
| st30=createAggregateStatement { $stmt = st30; }
| st31=dropAggregateStatement { $stmt = st31; }
#endif
| st32=createViewStatement { $stmt = st32; }
;
/*
@@ -716,7 +718,7 @@ createTableStatement returns [shared_ptr<cql3::statements::create_table_statemen
cfamDefinition[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
: '(' cfamColumns[expr] ( ',' cfamColumns[expr]? )* ')'
( K_WITH cfamProperty[expr] ( K_AND cfamProperty[expr] )*)?
( K_WITH cfamProperty[$expr->properties()] ( K_AND cfamProperty[$expr->properties()] )*)?
;
cfamColumns[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
@@ -732,15 +734,15 @@ pkDef[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
| '(' k1=ident { l.push_back(k1); } ( ',' kn=ident { l.push_back(kn); } )* ')' { $expr->add_key_aliases(l); }
;
cfamProperty[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
: property[expr->properties]
| K_COMPACT K_STORAGE { $expr->set_compact_storage(); }
cfamProperty[cql3::statements::cf_properties& expr]
: property[$expr.properties()]
| K_COMPACT K_STORAGE { $expr.set_compact_storage(); }
| K_CLUSTERING K_ORDER K_BY '(' cfamOrdering[expr] (',' cfamOrdering[expr])* ')'
;
cfamOrdering[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
cfamOrdering[cql3::statements::cf_properties& expr]
@init{ bool reversed=false; }
: k=ident (K_ASC | K_DESC { reversed=true;} ) { $expr->set_ordering(k, reversed); }
: k=ident (K_ASC | K_DESC { reversed=true;} ) { $expr.set_ordering(k, reversed); }
;
@@ -787,6 +789,39 @@ indexIdent returns [::shared_ptr<index_target::raw> id]
| K_FULL '(' c=cident ')' { $id = index_target::raw::full_collection(c); }
;
/**
* CREATE MATERIALIZED VIEW <viewName> AS
* SELECT <columns>
* FROM <CF>
* WHERE <pkColumns> IS NOT NULL
* PRIMARY KEY (<pkColumns>)
* WITH <property> = <value> AND ...;
*/
createViewStatement returns [::shared_ptr<create_view_statement> expr]
@init {
bool if_not_exists = false;
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys;
std::vector<::shared_ptr<cql3::column_identifier::raw>> composite_keys;
}
: K_CREATE K_MATERIALIZED K_VIEW (K_IF K_NOT K_EXISTS { if_not_exists = true; })? cf=columnFamilyName K_AS
K_SELECT sclause=selectClause K_FROM basecf=columnFamilyName
(K_WHERE wclause=whereClause)?
K_PRIMARY K_KEY (
'(' '(' k1=cident { partition_keys.push_back(k1); } ( ',' kn=cident { partition_keys.push_back(kn); } )* ')' ( ',' c1=cident { composite_keys.push_back(c1); } )* ')'
| '(' k1=cident { partition_keys.push_back(k1); } ( ',' cn=cident { composite_keys.push_back(cn); } )* ')'
)
{
$expr = ::make_shared<create_view_statement>(
std::move(cf),
std::move(basecf),
std::move(sclause),
std::move(wclause),
std::move(partition_keys),
std::move(composite_keys),
if_not_exists);
}
( K_WITH cfamProperty[{ $expr->properties() }] ( K_AND cfamProperty[{ $expr->properties() }] )*)?
;
#if 0
/**
@@ -1304,7 +1339,8 @@ relation[std::vector<cql3::relation_ptr>& clauses]
| K_TOKEN l=tupleOfIdentifiers type=relationType t=term
{ $clauses.emplace_back(::make_shared<cql3::token_relation>(std::move(l), *type, std::move(t))); }
| name=cident K_IS K_NOT K_NULL {
$clauses.emplace_back(make_shared<cql3::single_column_relation>(std::move(name), cql3::operator_type::IS_NOT, cql3::constants::NULL_LITERAL)); }
| name=cident K_IN marker=inMarker
{ $clauses.emplace_back(make_shared<cql3::single_column_relation>(std::move(name), cql3::operator_type::IN, std::move(marker))); }
| name=cident K_IN in_values=singleColumnInValues
@@ -1528,6 +1564,8 @@ K_KEYSPACE: ( K E Y S P A C E
K_KEYSPACES: K E Y S P A C E S;
K_COLUMNFAMILY:( C O L U M N F A M I L Y
| T A B L E );
K_MATERIALIZED:M A T E R I A L I Z E D;
K_VIEW: V I E W;
K_INDEX: I N D E X;
K_CUSTOM: C U S T O M;
K_ON: O N;
@@ -1551,6 +1589,7 @@ K_DESC: D E S C;
K_ALLOW: A L L O W;
K_FILTERING: F I L T E R I N G;
K_IF: I F;
K_IS: I S;
K_CONTAINS: C O N T A I N S;
K_GRANT: G R A N T;

View File

@@ -52,5 +52,6 @@ const operator_type operator_type::IN(7, operator_type::IN, "IN");
const operator_type operator_type::CONTAINS(5, operator_type::CONTAINS, "CONTAINS");
const operator_type operator_type::CONTAINS_KEY(6, operator_type::CONTAINS_KEY, "CONTAINS_KEY");
const operator_type operator_type::NEQ(8, operator_type::NEQ, "!=");
const operator_type operator_type::IS_NOT(9, operator_type::IS_NOT, "IS NOT");
}

View File

@@ -58,6 +58,7 @@ public:
static const operator_type CONTAINS;
static const operator_type CONTAINS_KEY;
static const operator_type NEQ;
static const operator_type IS_NOT;
private:
int32_t _b;
const operator_type& _reverse;

View File

@@ -92,13 +92,33 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy,
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _collectd_regs{
scollectd::add_polled_metric(scollectd::type_instance_id("query_processor"
, scollectd::per_cpu_plugin_instance
, "total_operations", "statements_prepared")
, scollectd::make_typed(scollectd::data_type::DERIVE, _stats.prepare_invocations)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "reads")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.reads)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "inserts")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.inserts)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "updates")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.updates)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "deletes")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.deletes)),
scollectd::add_polled_metric(scollectd::type_instance_id("cql"
, scollectd::per_cpu_plugin_instance
, "total_operations", "batches")
, scollectd::make_typed(scollectd::data_type::DERIVE, _cql_stats.batches))}
, _internal_state(new internal_state())
{
_collectd_regs.push_back(
scollectd::add_polled_metric(scollectd::type_instance_id("query_processor"
, scollectd::per_cpu_plugin_instance
, "total_operations", "statements_prepared")
, scollectd::make_typed(scollectd::data_type::DERIVE, _stats.prepare_invocations)));
service::get_local_migration_manager().register_listener(_migration_subscriber.get());
}
@@ -285,7 +305,7 @@ query_processor::get_statement(const sstring_view& query, const service::client_
Tracing.trace("Preparing statement");
#endif
++_stats.prepare_invocations;
return statement->prepare(_db.local());
return statement->prepare(_db.local(), _cql_stats);
}
::shared_ptr<raw::parsed_statement>
@@ -346,7 +366,7 @@ query_options query_processor::make_internal_options(::shared_ptr<statements::pr
{
auto& p = _internal_statements[query_string];
if (p == nullptr) {
auto np = parse_statement(query_string)->prepare(_db.local());
auto np = parse_statement(query_string)->prepare(_db.local(), _cql_stats);
np->statement->validate(_proxy, *_internal_state);
p = std::move(np); // inserts it into map
}
@@ -382,7 +402,7 @@ query_processor::process(const sstring& query_string,
const std::initializer_list<data_value>& values,
bool cache)
{
auto p = cache ? prepare_internal(query_string) : parse_statement(query_string)->prepare(_db.local());
auto p = cache ? prepare_internal(query_string) : parse_statement(query_string)->prepare(_db.local(), _cql_stats);
if (!cache) {
p->statement->validate(_proxy, *_internal_state);
}

View File

@@ -75,7 +75,9 @@ private:
uint64_t prepare_invocations = 0;
} _stats;
std::vector<scollectd::registration> _collectd_regs;
cql_stats _cql_stats;
scollectd::registrations _collectd_regs;
class internal_state;
std::unique_ptr<internal_state> _internal_state;
@@ -92,6 +94,11 @@ public:
distributed<service::storage_proxy>& proxy() {
return _proxy;
}
cql_stats& get_cql_stats() {
return _cql_stats;
}
#if 0
public static final QueryProcessor instance = new QueryProcessor();
#endif

View File

@@ -156,6 +156,10 @@ public:
return new_contains_restriction(db, schema, bound_names, false);
} else if (_relation_type == operator_type::CONTAINS_KEY) {
return new_contains_restriction(db, schema, bound_names, true);
} else if (_relation_type == operator_type::IS_NOT) {
// This case is not supposed to happen: statement_restrictions
// constructor does not call this function for views' IS_NOT.
throw exceptions::invalid_request_exception(sprint("Unsupported \"IS NOT\" relation: %s", to_string()));
} else {
throw exceptions::invalid_request_exception(sprint("Unsupported \"!=\" relation: %s", to_string()));
}

View File

@@ -28,6 +28,9 @@
#include "single_column_primary_key_restrictions.hh"
#include "token_restriction.hh"
#include "cql3/single_column_relation.hh"
#include "cql3/constants.hh"
namespace cql3 {
namespace restrictions {
@@ -131,13 +134,21 @@ statement_restrictions::statement_restrictions(schema_ptr schema)
, _clustering_columns_restrictions(get_initial_key_restrictions<clustering_key_prefix>())
, _nonprimary_key_restrictions(::make_shared<single_column_restrictions>(schema))
{ }
#if 0
static const column_definition*
to_column_definition(const schema_ptr& schema, const ::shared_ptr<column_identifier::raw>& entity) {
return get_column_definition(schema,
*entity->prepare_column_identifier(schema));
}
#endif
statement_restrictions::statement_restrictions(database& db,
schema_ptr schema,
const std::vector<::shared_ptr<relation>>& where_clause,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection)
bool select_a_collection,
bool for_view)
: statement_restrictions(schema)
{
/*
@@ -149,7 +160,31 @@ statement_restrictions::statement_restrictions(database& db,
*/
if (!where_clause.empty()) {
for (auto&& relation : where_clause) {
add_restriction(relation->to_restriction(db, schema, bound_names));
if (relation->get_operator() == cql3::operator_type::IS_NOT) {
single_column_relation* r =
dynamic_cast<single_column_relation*>(relation.get());
// The "IS NOT NULL" restriction is only supported (and
// mandatory) for materialized view creation:
if (!r) {
throw exceptions::invalid_request_exception("IS NOT only supports single column");
}
// currently, the grammar only allows the NULL argument to be
// "IS NOT", so this assertion should not be able to fail
assert(r->get_value() == cql3::constants::NULL_LITERAL);
auto col_id = r->get_entity()->prepare_column_identifier(schema);
const auto *cd = get_column_definition(schema, *col_id);
if (!cd) {
throw exceptions::invalid_request_exception(sprint("restriction '%s' unknown column %s", relation->to_string(), r->get_entity()->to_string()));
}
_not_null_columns.insert(cd);
if (!for_view) {
throw exceptions::invalid_request_exception(sprint("restriction '%s' is only supported in materialized view creation", relation->to_string()));
}
} else {
add_restriction(relation->to_restriction(db, schema, bound_names));
}
}
}

View File

@@ -83,6 +83,8 @@ private:
*/
::shared_ptr<single_column_restrictions> _nonprimary_key_restrictions;
std::unordered_set<const column_definition*> _not_null_columns;
/**
* The restrictions used to build the index expressions
*/
@@ -112,7 +114,8 @@ public:
const std::vector<::shared_ptr<relation>>& where_clause,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection);
bool select_a_collection,
bool for_view = false);
private:
void add_restriction(::shared_ptr<restriction> restriction);
void add_single_column_restriction(::shared_ptr<single_column_restriction> restriction);

View File

@@ -97,7 +97,14 @@ public:
if (!buf) {
throw exceptions::invalid_request_exception("Invalid null token value");
}
return dht::token(dht::token::kind::key, *buf);
auto tk = dht::global_partitioner().from_bytes(*buf);
if (tk.is_minimum() && !is_start(b)) {
// The token was parsed as a minimum marker (token::kind::before_all_keys), but
// as it appears in the end bound position, it is actually the maximum marker
// (token::kind::after_all_keys).
return dht::maximum_token();
}
return tk;
};
const auto start_token = get_token_bound(statements::bound::START);

View File

@@ -232,7 +232,7 @@ uint32_t selection::add_column_for_ordering(const column_definition& c) {
raw_selector::to_selectables(raw_selectors, schema), db, schema, defs);
auto metadata = collect_metadata(schema, raw_selectors, *factories);
if (processes_selection(raw_selectors)) {
if (processes_selection(raw_selectors) || raw_selectors.size() != defs.size()) {
return ::make_shared<selection_with_processing>(schema, std::move(defs), std::move(metadata), std::move(factories));
} else {
return ::make_shared<simple_selection>(schema, std::move(defs), std::move(metadata), false);

View File

@@ -110,6 +110,11 @@ public:
::shared_ptr<term::raw> get_map_key() {
return _map_key;
}
::shared_ptr<term::raw> get_value() {
return _value;
}
protected:
virtual ::shared_ptr<term> to_term(const std::vector<::shared_ptr<column_specification>>& receivers,
::shared_ptr<term::raw> raw, database& db, const sstring& keyspace,

View File

@@ -103,7 +103,7 @@ shared_ptr<transport::event::schema_change> cql3::statements::alter_keyspace_sta
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_keyspace_statement::prepare(database& db) {
cql3::statements::alter_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_keyspace_statement>(*this));
}

View File

@@ -63,7 +63,7 @@ public:
void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -274,7 +274,7 @@ shared_ptr<transport::event::schema_change> alter_table_statement::change_event(
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_table_statement::prepare(database& db) {
cql3::statements::alter_table_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_table_statement>(*this));
}

View File

@@ -80,7 +80,7 @@ public:
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -43,6 +43,7 @@
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "boost/range/adaptor/map.hpp"
#include "stdx.hh"
namespace cql3 {
@@ -86,14 +87,14 @@ const sstring& alter_type_statement::keyspace() const
return _name.get_keyspace();
}
static int32_t get_idx_of_field(user_type type, shared_ptr<column_identifier> field)
static stdx::optional<uint32_t> get_idx_of_field(user_type type, shared_ptr<column_identifier> field)
{
for (uint32_t i = 0; i < type->field_names().size(); ++i) {
if (field->name() == type->field_names()[i]) {
return i;
return {i};
}
}
return -1;
return {};
}
void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, bool is_local_only)
@@ -164,7 +165,7 @@ alter_type_statement::add_or_alter::add_or_alter(const ut_name& name, bool is_ad
user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_update) const
{
if (get_idx_of_field(to_update, _field_name) >= 0) {
if (get_idx_of_field(to_update, _field_name)) {
throw exceptions::invalid_request_exception(sprint("Cannot add new field %s to type %s: a field of the same name already exists", _field_name->name(), _name.to_string()));
}
@@ -181,19 +182,19 @@ user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_
user_type alter_type_statement::add_or_alter::do_alter(database& db, user_type to_update) const
{
uint32_t idx = get_idx_of_field(to_update, _field_name);
if (idx < 0) {
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, _field_name);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", _field_name->name(), _name.to_string()));
}
auto previous = to_update->field_types()[idx];
auto previous = to_update->field_types()[*idx];
auto new_type = _field_type->prepare(db, keyspace())->get_type();
if (!new_type->is_compatible_with(*previous)) {
throw exceptions::invalid_request_exception(sprint("Type %s in incompatible with previous type %s of field %s in user type %s", _field_type->to_string(), previous->as_cql3_type()->to_string(), _field_name->name(), _name.to_string()));
}
std::vector<data_type> new_types(to_update->field_types());
new_types[idx] = new_type;
new_types[*idx] = new_type;
return user_type_impl::get_instance(to_update->_keyspace, to_update->_name, to_update->field_names(), std::move(new_types));
}
@@ -217,11 +218,11 @@ user_type alter_type_statement::renames::make_updated_type(database& db, user_ty
std::vector<bytes> new_names(to_update->field_names());
for (auto&& rename : _renames) {
auto&& from = rename.first;
int32_t idx = get_idx_of_field(to_update, from);
if (idx < 0) {
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, from);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", from->to_string(), _name.to_string()));
}
new_names[idx] = rename.second->name();
new_names[*idx] = rename.second->name();
}
auto&& updated = user_type_impl::get_instance(to_update->_keyspace, to_update->_name, std::move(new_names), to_update->field_types());
create_type_statement::check_for_duplicate_names(updated);
@@ -229,12 +230,12 @@ user_type alter_type_statement::renames::make_updated_type(database& db, user_ty
}
shared_ptr<cql3::statements::prepared_statement>
alter_type_statement::add_or_alter::prepare(database& db) {
alter_type_statement::add_or_alter::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_type_statement::add_or_alter>(*this));
}
shared_ptr<cql3::statements::prepared_statement>
alter_type_statement::renames::prepare(database& db) {
alter_type_statement::renames::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_type_statement::renames>(*this));
}

View File

@@ -84,7 +84,7 @@ public:
const shared_ptr<column_identifier> field_name,
const shared_ptr<cql3_type::raw> field_type);
virtual user_type make_updated_type(database& db, user_type to_update) const override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
private:
user_type do_add(database& db, user_type to_update) const;
user_type do_alter(database& db, user_type to_update) const;
@@ -101,7 +101,7 @@ public:
void add_rename(shared_ptr<column_identifier> previous_name, shared_ptr<column_identifier> new_name);
virtual user_type make_updated_type(database& db, user_type to_update) const override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -47,7 +47,7 @@ uint32_t cql3::statements::authentication_statement::get_bound_terms() {
}
::shared_ptr<cql3::statements::prepared_statement> cql3::statements::authentication_statement::prepare(
database& db) {
database& db, cql_stats& stats) {
return ::make_shared<prepared>(this->shared_from_this());
}

View File

@@ -54,7 +54,7 @@ class authentication_statement : public raw::parsed_statement, public cql_statem
public:
uint32_t get_bound_terms() override;
::shared_ptr<prepared> prepare(database& db) override;
::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
bool uses_function(const sstring& ks_name, const sstring& function_name) const override;

View File

@@ -47,7 +47,7 @@ uint32_t cql3::statements::authorization_statement::get_bound_terms() {
}
::shared_ptr<cql3::statements::prepared_statement> cql3::statements::authorization_statement::prepare(
database& db) {
database& db, cql_stats& stats) {
return ::make_shared<parsed_statement::prepared>(this->shared_from_this());
}

View File

@@ -58,7 +58,7 @@ class authorization_statement : public raw::parsed_statement, public cql_stateme
public:
uint32_t get_bound_terms() override;
::shared_ptr<prepared> prepare(database& db) override;
::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
bool uses_function(const sstring& ks_name, const sstring& function_name) const override;

View File

@@ -102,18 +102,18 @@ void batch_statement::verify_batch_size(const std::vector<mutation>& mutations)
namespace raw {
shared_ptr<prepared_statement>
batch_statement::prepare(database& db) {
batch_statement::prepare(database& db, cql_stats& stats) {
auto&& bound_names = get_bound_variables();
std::vector<shared_ptr<cql3::statements::modification_statement>> statements;
for (auto&& parsed : _parsed_statements) {
statements.push_back(parsed->prepare(db, bound_names));
statements.push_back(parsed->prepare(db, bound_names, stats));
}
auto&& prep_attrs = _attrs->prepare(db, "[batch]", "[batch]");
prep_attrs->collect_marker_specification(bound_names);
cql3::statements::batch_statement batch_statement_(bound_names->size(), _type, std::move(statements), std::move(prep_attrs));
cql3::statements::batch_statement batch_statement_(bound_names->size(), _type, std::move(statements), std::move(prep_attrs), stats);
batch_statement_.validate();
return ::make_shared<prepared>(make_shared(std::move(batch_statement_)),

View File

@@ -74,6 +74,7 @@ private:
std::vector<shared_ptr<modification_statement>> _statements;
std::unique_ptr<attributes> _attrs;
bool _has_conditions;
cql_stats& _stats;
public:
/**
* Creates a new BatchStatement from a list of statements and a
@@ -85,10 +86,12 @@ public:
*/
batch_statement(int bound_terms, type type_,
std::vector<shared_ptr<modification_statement>> statements,
std::unique_ptr<attributes> attrs)
std::unique_ptr<attributes> attrs,
cql_stats& stats)
: _bound_terms(bound_terms), _type(type_), _statements(std::move(statements))
, _attrs(std::move(attrs))
, _has_conditions(boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::has_conditions))) {
, _has_conditions(boost::algorithm::any_of(_statements, std::mem_fn(&modification_statement::has_conditions)))
, _stats(stats) {
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
@@ -175,6 +178,7 @@ private:
boost::make_counting_iterator<size_t>(_statements.size()),
[this, &storage, &options, now, local, &result, trace_state] (size_t i) {
auto&& statement = _statements[i];
statement->inc_cql_stats();
auto&& statement_options = options.for_statement(i);
auto timestamp = _attrs->get_timestamp(now, statement_options);
return statement->get_mutations(storage, statement_options, local, timestamp, trace_state).then([&result] (auto&& more) {
@@ -195,6 +199,7 @@ public:
virtual future<shared_ptr<transport::messages::result_message>> execute(
distributed<service::storage_proxy>& storage, service::query_state& state, const query_options& options) override {
++_stats.batches;
return execute(storage, state, options, false, options.get_timestamp(state));
}
private:

View File

@@ -100,12 +100,12 @@ void cf_prop_defs::validate() {
}
auto compression_options = get_compression_options();
if (!compression_options.empty()) {
auto sstable_compression_class = compression_options.find(sstring(compression_parameters::SSTABLE_COMPRESSION));
if (sstable_compression_class == compression_options.end()) {
if (compression_options && !compression_options->empty()) {
auto sstable_compression_class = compression_options->find(sstring(compression_parameters::SSTABLE_COMPRESSION));
if (sstable_compression_class == compression_options->end()) {
throw exceptions::configuration_exception(sstring("Missing sub-option '") + compression_parameters::SSTABLE_COMPRESSION + "' for the '" + KW_COMPRESSION + "' option.");
}
compression_parameters cp(compression_options);
compression_parameters cp(*compression_options);
cp.validate();
}
@@ -131,12 +131,12 @@ std::map<sstring, sstring> cf_prop_defs::get_compaction_options() const {
return std::map<sstring, sstring>{};
}
std::map<sstring, sstring> cf_prop_defs::get_compression_options() const {
stdx::optional<std::map<sstring, sstring>> cf_prop_defs::get_compression_options() const {
auto compression_options = get_map(KW_COMPRESSION);
if (compression_options) {
return compression_options.value();
return { compression_options.value() };
}
return std::map<sstring, sstring>{};
return { };
}
int32_t cf_prop_defs::get_default_time_to_live() const
@@ -206,8 +206,9 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder) {
}
builder.set_bloom_filter_fp_chance(get_double(KW_BF_FP_CHANCE, builder.get_bloom_filter_fp_chance()));
if (!get_compression_options().empty()) {
builder.set_compressor_params(compression_parameters(get_compression_options()));
auto compression_options = get_compression_options();
if (compression_options) {
builder.set_compressor_params(compression_parameters(*compression_options));
}
#if 0
CachingOptions cachingOptions = getCachingOptions();

View File

@@ -82,7 +82,7 @@ private:
public:
void validate();
std::map<sstring, sstring> get_compaction_options() const;
std::map<sstring, sstring> get_compression_options() const;
stdx::optional<std::map<sstring, sstring>> get_compression_options() const;
#if 0
public CachingOptions getCachingOptions() throws SyntaxException, ConfigurationException
{

View File

@@ -0,0 +1,97 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/statements/cf_prop_defs.hh"
namespace cql3 {
namespace statements {
/**
* Class for common statement properties.
*/
class cf_properties final {
const ::shared_ptr<cf_prop_defs> _properties = ::make_shared<cf_prop_defs>();
bool _use_compact_storage = false;
std::vector<std::pair<::shared_ptr<column_identifier>, bool>> _defined_ordering; // Insertion ordering is important
public:
auto& properties() const {
return _properties;
}
bool use_compact_storage() const {
return _use_compact_storage;
}
void set_compact_storage() {
_use_compact_storage = true;
}
auto& defined_ordering() const {
return _defined_ordering;
}
data_type get_reversable_type(::shared_ptr<column_identifier> t, data_type type) const {
auto is_reversed = find_ordering_info(t);
return is_reversed && *is_reversed ? reversed_type_impl::get_instance(type) : type;
}
std::experimental::optional<bool> find_ordering_info(::shared_ptr<column_identifier> type) const {
for (auto& t: _defined_ordering) {
if (*(t.first) == *type) {
return t.second;
}
}
return {};
}
void set_ordering(::shared_ptr<column_identifier> alias, bool reversed) {
_defined_ordering.emplace_back(alias, reversed);
}
void validate() {
_properties->validate();
}
};
}
}

View File

@@ -207,7 +207,7 @@ cql3::statements::create_index_statement::announce_migration(distributed<service
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::create_index_statement::prepare(database& db) {
cql3::statements::create_index_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_index_statement>(*this));
}

View File

@@ -87,7 +87,7 @@ public:
transport::event::schema_change::target_type::TABLE, keyspace(),
column_family());
}
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -124,7 +124,7 @@ shared_ptr<transport::event::schema_change> create_keyspace_statement::change_ev
}
shared_ptr<cql3::statements::prepared_statement>
cql3::statements::create_keyspace_statement::prepare(database& db) {
cql3::statements::create_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_keyspace_statement>(*this));
}

View File

@@ -84,7 +84,7 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -64,12 +64,6 @@ create_table_statement::create_table_statement(::shared_ptr<cf_name> name,
, _properties{properties}
, _if_not_exists{if_not_exists}
{
if (!properties->has_property(cf_prop_defs::KW_COMPRESSION) && schema::DEFAULT_COMPRESSOR) {
std::map<sstring, sstring> compression = {
{ sstring(compression_parameters::SSTABLE_COMPRESSION), schema::DEFAULT_COMPRESSOR.value() },
};
properties->add_property(cf_prop_defs::KW_COMPRESSION, compression);
}
}
future<> create_table_statement::check_access(const service::client_state& state) {
@@ -157,7 +151,7 @@ void create_table_statement::add_column_metadata_from_aliases(schema_builder& bu
}
shared_ptr<prepared_statement>
create_table_statement::prepare(database& db) {
create_table_statement::prepare(database& db, cql_stats& stats) {
// Cannot happen; create_table_statement is never instantiated as a raw statement
// (instead we instantiate create_table_statement::raw_statement)
abort();
@@ -169,7 +163,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
, _if_not_exists{if_not_exists}
{ }
::shared_ptr<prepared_statement> create_table_statement::raw_statement::prepare(database& db) {
::shared_ptr<prepared_statement> create_table_statement::raw_statement::prepare(database& db, cql_stats& stats) {
// Column family name
const sstring& cf_name = _cf_name->get_column_family();
std::regex name_regex("\\w+");
@@ -188,9 +182,9 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
throw exceptions::invalid_request_exception(sprint("Multiple definition of identifier %s", (*i)->text()));
}
properties->validate();
_properties.validate();
auto stmt = ::make_shared<create_table_statement>(_cf_name, properties, _if_not_exists, _static_columns);
auto stmt = ::make_shared<create_table_statement>(_cf_name, _properties.properties(), _if_not_exists, _static_columns);
std::experimental::optional<std::map<bytes, data_type>> defined_multi_cell_collections;
for (auto&& entry : _definitions) {
@@ -214,7 +208,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
throw exceptions::invalid_request_exception("Multiple PRIMARY KEYs specifed (exactly one required)");
}
stmt->_use_compact_storage = _use_compact_storage;
stmt->_use_compact_storage = _properties.use_compact_storage();
auto& key_aliases = _key_aliases[0];
std::vector<data_type> key_types;
@@ -233,7 +227,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
// Handle column aliases
if (_column_aliases.empty()) {
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
// There should remain some column definition since it is a non-composite "static" CF
if (stmt->_columns.empty()) {
throw exceptions::invalid_request_exception("No definition found that is not part of the PRIMARY KEY");
@@ -246,7 +240,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
} else {
// If we use compact storage and have only one alias, it is a
// standard "dynamic" CF, otherwise it's a composite
if (_use_compact_storage && _column_aliases.size() == 1) {
if (_properties.use_compact_storage() && _column_aliases.size() == 1) {
if (defined_multi_cell_collections) {
throw exceptions::invalid_request_exception("Collection types are not supported with COMPACT STORAGE");
}
@@ -274,7 +268,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
types.emplace_back(type);
}
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
if (defined_multi_cell_collections) {
throw exceptions::invalid_request_exception("Collection types are not supported with COMPACT STORAGE");
}
@@ -287,7 +281,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
if (!_static_columns.empty()) {
// Only CQL3 tables can have static columns
if (_use_compact_storage) {
if (_properties.use_compact_storage()) {
throw exceptions::invalid_request_exception("Static columns are not supported in COMPACT STORAGE tables");
}
// Static columns only make sense if we have at least one clustering column. Otherwise everything is static anyway
@@ -296,7 +290,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
}
}
if (_use_compact_storage && !stmt->_column_aliases.empty()) {
if (_properties.use_compact_storage() && !stmt->_column_aliases.empty()) {
if (stmt->_columns.empty()) {
#if 0
// The only value we'll insert will be the empty one, so the default validator don't matter
@@ -322,7 +316,7 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
} else {
// For compact, we are in the "static" case, so we need at least one column defined. For non-compact however, having
// just the PK is fine since we have CQL3 row marker.
if (_use_compact_storage && stmt->_columns.empty()) {
if (_properties.use_compact_storage() && stmt->_columns.empty()) {
throw exceptions::invalid_request_exception("COMPACT STORAGE with non-composite PRIMARY KEY require one column not part of the PRIMARY KEY, none given");
}
#if 0
@@ -335,18 +329,18 @@ create_table_statement::raw_statement::raw_statement(::shared_ptr<cf_name> name,
}
// If we give a clustering order, we must explicitly do so for all aliases and in the order of the PK
if (!_defined_ordering.empty()) {
if (_defined_ordering.size() > _column_aliases.size()) {
if (!_properties.defined_ordering().empty()) {
if (_properties.defined_ordering().size() > _column_aliases.size()) {
throw exceptions::invalid_request_exception("Only clustering key columns can be defined in CLUSTERING ORDER directive");
}
int i = 0;
for (auto& pair: _defined_ordering){
for (auto& pair: _properties.defined_ordering()){
auto& id = pair.first;
auto& c = _column_aliases.at(i);
if (!(*id == *c)) {
if (find_ordering_info(c)) {
if (_properties.find_ordering_info(c)) {
throw exceptions::invalid_request_exception(sprint("The order of columns in the CLUSTERING ORDER directive must be the one of the clustering key (%s must appear before %s)", c, id));
} else {
throw exceptions::invalid_request_exception(sprint("Missing CLUSTERING ORDER for column %s", c));
@@ -371,12 +365,7 @@ data_type create_table_statement::raw_statement::get_type_and_remove(column_map_
}
columns.erase(t);
auto is_reversed = find_ordering_info(t);
if (!is_reversed) {
return type;
} else {
return *is_reversed ? reversed_type_impl::get_instance(type) : type;
}
return _properties.get_reversable_type(t, type);
}
void create_table_statement::raw_statement::add_definition(::shared_ptr<column_identifier> def, ::shared_ptr<cql3_type::raw> type, bool is_static) {
@@ -395,14 +384,6 @@ void create_table_statement::raw_statement::add_column_alias(::shared_ptr<column
_column_aliases.emplace_back(alias);
}
void create_table_statement::raw_statement::set_ordering(::shared_ptr<column_identifier> alias, bool reversed) {
_defined_ordering.emplace_back(alias, reversed);
}
void create_table_statement::raw_statement::set_compact_storage() {
_use_compact_storage = true;
}
}
}

View File

@@ -43,6 +43,7 @@
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/cf_properties.hh"
#include "cql3/statements/raw/cf_statement.hh"
#include "cql3/cql3_type.hh"
@@ -103,7 +104,7 @@ public:
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
schema_ptr get_cf_meta_data();
@@ -125,30 +126,22 @@ private:
shared_ptr_value_hash<column_identifier>,
shared_ptr_equal_by_value<column_identifier>>;
defs_type _definitions;
public:
const ::shared_ptr<cf_prop_defs> properties = ::make_shared<cf_prop_defs>();
private:
std::vector<std::vector<::shared_ptr<column_identifier>>> _key_aliases;
std::vector<::shared_ptr<column_identifier>> _column_aliases;
std::vector<std::pair<::shared_ptr<column_identifier>, bool>> _defined_ordering; // Insertion ordering is important
std::experimental::optional<bool> find_ordering_info(::shared_ptr<column_identifier> type) {
for (auto& t: _defined_ordering) {
if (*(t.first) == *type) {
return t.second;
}
}
return {};
}
create_table_statement::column_set_type _static_columns;
bool _use_compact_storage = false;
std::multiset<::shared_ptr<column_identifier>,
indirect_less<::shared_ptr<column_identifier>, column_identifier::text_comparator>> _defined_names;
bool _if_not_exists;
cf_properties _properties;
public:
raw_statement(::shared_ptr<cf_name> name, bool if_not_exists);
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
cf_properties& properties() {
return _properties;
}
data_type get_type_and_remove(column_map_type& columns, ::shared_ptr<column_identifier> t);
@@ -157,10 +150,6 @@ public:
void add_key_aliases(const std::vector<::shared_ptr<column_identifier>> aliases);
void add_column_alias(::shared_ptr<column_identifier> alias);
void set_ordering(::shared_ptr<column_identifier> alias, bool reversed);
void set_compact_storage();
};
}

View File

@@ -157,7 +157,7 @@ future<bool> create_type_statement::announce_migration(distributed<service::stor
}
shared_ptr<cql3::statements::prepared_statement>
create_type_statement::prepare(database& db) {
create_type_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_type_statement>(*this));
}

View File

@@ -69,7 +69,7 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
static void check_for_duplicate_names(user_type type);
private:

View File

@@ -0,0 +1,344 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <inttypes.h>
#include <regex>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/algorithm/adjacent_find.hpp>
#include "cql3/statements/create_view_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "schema_builder.hh"
#include "service/storage_proxy.hh"
namespace cql3 {
namespace statements {
create_view_statement::create_view_statement(
::shared_ptr<cf_name> view_name,
::shared_ptr<cf_name> base_name,
std::vector<::shared_ptr<selection::raw_selector>> select_clause,
std::vector<::shared_ptr<relation>> where_clause,
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys,
std::vector<::shared_ptr<cql3::column_identifier::raw>> clustering_keys,
bool if_not_exists)
: schema_altering_statement{view_name}
, _base_name{base_name}
, _select_clause{select_clause}
, _where_clause{where_clause}
, _partition_keys{partition_keys}
, _clustering_keys{clustering_keys}
, _if_not_exists{if_not_exists}
{
// TODO: probably need to create a "statement_restrictions" like select does
// based on the select_clause, base_name and where_clause; However need to
// pass for_view=true.
fail(unimplemented::cause::VIEWS);
}
// FIXME: I copied the following from create_table_statement. I don't know
// what they do or whether they need to change for create view.
future<> create_view_statement::check_access(const service::client_state& state) {
return state.has_keyspace_access(keyspace(), auth::permission::CREATE);
}
void create_view_statement::validate(distributed<service::storage_proxy>&, const service::client_state& state) {
// validated in announceMigration()
}
future<bool> create_view_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) {
// FIXME: this code from create_table_view is probably wrong, the Java CreateViewStatement.announceMigration is much more elaborate
#if 0
****** Our implementation in creat_table_statement (simpler code but that was simpler also in Java)
return make_ready_future<>().then([this, is_local_only] {
return service::get_local_migration_manager().announce_new_column_family(get_cf_meta_data(), is_local_only);
}).then_wrapped([this] (auto&& f) {
try {
f.get();
return true;
} catch (const exceptions::already_exists_exception& e) {
if (_if_not_exists) {
return false;
}
throw e;
}
});
#endif
#if 0
***** This if 0 code is from Cassandra CreateViewStatement
// We need to make sure that:
// - primary key includes all columns in base table's primary key
// - make sure that the select statement does not have anything other than columns
// and their names match the base table's names
// - make sure that primary key does not include any collections
// - make sure there is no where clause in the select statement
// - make sure there is not currently a table or view
// - make sure baseTable gcGraceSeconds > 0
properties.validate();
if (properties.useCompactStorage)
throw new InvalidRequestException("Cannot use 'COMPACT STORAGE' when defining a materialized view");
#endif
// View and base tables must be in the same keyspace, to ensure that RF
// is the same (because we assign a view replica to each base replica).
// If a keyspace was not specified for the base table name, it is assumed
// it is in the same keyspace as the view table being created (which
// itself might be the current USEd keyspace, or explicitly specified).
if (_base_name->get_keyspace().empty()) {
_base_name->set_keyspace(keyspace(), true);
}
if (_base_name->get_keyspace() != keyspace()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot create a materialized view on a table in a separate keyspace ('%s' != '%s')",
_base_name->get_keyspace(), keyspace()));
}
// Validate that the keyspace and the base table exist, and is not a
// special table on which we cannot create views:
// CONTINUE HERE: something like the code below taken from migration_manager instead of the validateColumFamily below
auto& db = service::get_local_storage_proxy().get_db().local();
if (!db.has_keyspace(_base_name->get_keyspace())) {
throw exceptions::invalid_request_exception(sprint(
"Keyspace '%s' does not exist", _base_name->get_keyspace()));
}
// auto& ks = db.find_keyspace(_base_name->get_keyspace());
if (!db.has_schema(_base_name->get_keyspace(), _base_name->get_column_family())) {
throw exceptions::invalid_request_exception(sprint(
"Base table '%s' does not exist",
_base_name->get_column_family()));
}
return make_ready_future<bool>(true);
// if (db.has_schema(_base_name->get_keyspace(), _base_name->get_column_family())) {
// throw exceptions::already_exists_exception(cfm->ks_name(), cfm->cf_name());
// }
#if 0
CFMetaData cfm = ThriftValidation.validateColumnFamily(baseName.getKeyspace(), baseName.getColumnFamily());
if (cfm.isCounter())
throw new InvalidRequestException("Materialized views are not supported on counter tables");
if (cfm.isView())
throw new InvalidRequestException("Materialized views cannot be created against other materialized views");
if (cfm.params.gcGraceSeconds == 0)
{
throw new InvalidRequestException(String.format("Cannot create materialized view '%s' for base table " +
"'%s' with gc_grace_seconds of 0, since this value is " +
"used to TTL undelivered updates. Setting gc_grace_seconds" +
" too low might cause undelivered updates to expire " +
"before being replayed.", cfName.getColumnFamily(),
baseName.getColumnFamily()));
}
Set<ColumnIdentifier> included = new HashSet<>();
for (RawSelector selector : selectClause)
{
Selectable.Raw selectable = selector.selectable;
if (selectable instanceof Selectable.WithFieldSelection.Raw)
throw new InvalidRequestException("Cannot select out a part of type when defining a materialized view");
if (selectable instanceof Selectable.WithFunction.Raw)
throw new InvalidRequestException("Cannot use function when defining a materialized view");
if (selectable instanceof Selectable.WritetimeOrTTL.Raw)
throw new InvalidRequestException("Cannot use function when defining a materialized view");
ColumnIdentifier identifier = (ColumnIdentifier) selectable.prepare(cfm);
if (selector.alias != null)
throw new InvalidRequestException(String.format("Cannot alias column '%s' as '%s' when defining a materialized view", identifier.toString(), selector.alias.toString()));
ColumnDefinition cdef = cfm.getColumnDefinition(identifier);
if (cdef == null)
throw new InvalidRequestException("Unknown column name detected in CREATE MATERIALIZED VIEW statement : "+identifier);
included.add(identifier);
}
Set<ColumnIdentifier.Raw> targetPrimaryKeys = new HashSet<>();
for (ColumnIdentifier.Raw identifier : Iterables.concat(partitionKeys, clusteringKeys))
{
if (!targetPrimaryKeys.add(identifier))
throw new InvalidRequestException("Duplicate entry found in PRIMARY KEY: "+identifier);
ColumnDefinition cdef = cfm.getColumnDefinition(identifier.prepare(cfm));
if (cdef == null)
throw new InvalidRequestException("Unknown column name detected in CREATE MATERIALIZED VIEW statement : "+identifier);
if (cfm.getColumnDefinition(identifier.prepare(cfm)).type.isMultiCell())
throw new InvalidRequestException(String.format("Cannot use MultiCell column '%s' in PRIMARY KEY of materialized view", identifier));
if (cdef.isStatic())
throw new InvalidRequestException(String.format("Cannot use Static column '%s' in PRIMARY KEY of materialized view", identifier));
}
// build the select statement
Map<ColumnIdentifier.Raw, Boolean> orderings = Collections.emptyMap();
SelectStatement.Parameters parameters = new SelectStatement.Parameters(orderings, false, true, false);
SelectStatement.RawStatement rawSelect = new SelectStatement.RawStatement(baseName, parameters, selectClause, whereClause, null, null);
ClientState state = ClientState.forInternalCalls();
state.setKeyspace(keyspace());
rawSelect.prepareKeyspace(state);
rawSelect.setBoundVariables(getBoundVariables());
ParsedStatement.Prepared prepared = rawSelect.prepare(true);
SelectStatement select = (SelectStatement) prepared.statement;
StatementRestrictions restrictions = select.getRestrictions();
if (!prepared.boundNames.isEmpty())
throw new InvalidRequestException("Cannot use query parameters in CREATE MATERIALIZED VIEW statements");
if (!restrictions.nonPKRestrictedColumns(false).isEmpty())
{
throw new InvalidRequestException(String.format(
"Non-primary key columns cannot be restricted in the SELECT statement used for materialized view " +
"creation (got restrictions on: %s)",
restrictions.nonPKRestrictedColumns(false).stream().map(def -> def.name.toString()).collect(Collectors.joining(", "))));
}
String whereClauseText = View.relationsToWhereClause(whereClause.relations);
Set<ColumnIdentifier> basePrimaryKeyCols = new HashSet<>();
for (ColumnDefinition definition : Iterables.concat(cfm.partitionKeyColumns(), cfm.clusteringColumns()))
basePrimaryKeyCols.add(definition.name);
List<ColumnIdentifier> targetClusteringColumns = new ArrayList<>();
List<ColumnIdentifier> targetPartitionKeys = new ArrayList<>();
// This is only used as an intermediate state; this is to catch whether multiple non-PK columns are used
boolean hasNonPKColumn = false;
for (ColumnIdentifier.Raw raw : partitionKeys)
hasNonPKColumn |= getColumnIdentifier(cfm, basePrimaryKeyCols, hasNonPKColumn, raw, targetPartitionKeys, restrictions);
for (ColumnIdentifier.Raw raw : clusteringKeys)
hasNonPKColumn |= getColumnIdentifier(cfm, basePrimaryKeyCols, hasNonPKColumn, raw, targetClusteringColumns, restrictions);
// We need to include all of the primary key columns from the base table in order to make sure that we do not
// overwrite values in the view. We cannot support "collapsing" the base table into a smaller number of rows in
// the view because if we need to generate a tombstone, we have no way of knowing which value is currently being
// used in the view and whether or not to generate a tombstone. In order to not surprise our users, we require
// that they include all of the columns. We provide them with a list of all of the columns left to include.
boolean missingClusteringColumns = false;
StringBuilder columnNames = new StringBuilder();
List<ColumnIdentifier> includedColumns = new ArrayList<>();
for (ColumnDefinition def : cfm.allColumns())
{
ColumnIdentifier identifier = def.name;
boolean includeDef = included.isEmpty() || included.contains(identifier);
if (includeDef && def.isStatic())
{
throw new InvalidRequestException(String.format("Unable to include static column '%s' which would be included by Materialized View SELECT * statement", identifier));
}
if (includeDef && !targetClusteringColumns.contains(identifier) && !targetPartitionKeys.contains(identifier))
{
includedColumns.add(identifier);
}
if (!def.isPrimaryKeyColumn()) continue;
if (!targetClusteringColumns.contains(identifier) && !targetPartitionKeys.contains(identifier))
{
if (missingClusteringColumns)
columnNames.append(',');
else
missingClusteringColumns = true;
columnNames.append(identifier);
}
}
if (missingClusteringColumns)
throw new InvalidRequestException(String.format("Cannot create Materialized View %s without primary key columns from base %s (%s)",
columnFamily(), baseName.getColumnFamily(), columnNames.toString()));
if (targetPartitionKeys.isEmpty())
throw new InvalidRequestException("Must select at least a column for a Materialized View");
if (targetClusteringColumns.isEmpty())
throw new InvalidRequestException("No columns are defined for Materialized View other than primary key");
CFMetaData.Builder cfmBuilder = CFMetaData.Builder.createView(keyspace(), columnFamily());
add(cfm, targetPartitionKeys, cfmBuilder::addPartitionKey);
add(cfm, targetClusteringColumns, cfmBuilder::addClusteringColumn);
add(cfm, includedColumns, cfmBuilder::addRegularColumn);
cfmBuilder.withId(properties.properties.getId());
TableParams params = properties.properties.asNewTableParams();
CFMetaData viewCfm = cfmBuilder.build().params(params);
ViewDefinition definition = new ViewDefinition(keyspace(),
columnFamily(),
Schema.instance.getId(keyspace(), baseName.getColumnFamily()),
baseName.getColumnFamily(),
included.isEmpty(),
rawSelect,
whereClauseText,
viewCfm);
try
{
MigrationManager.announceNewView(definition, isLocalOnly);
return new Event.SchemaChange(Event.SchemaChange.Change.CREATED, Event.SchemaChange.Target.TABLE, keyspace(), columnFamily());
}
catch (AlreadyExistsException e)
{
if (ifNotExists)
return null;
throw e;
}
#endif
}
shared_ptr<transport::event::schema_change> create_view_statement::change_event() {
// FIXME: this is probably wrong, I just copied it from create_table_statement
return make_shared<transport::event::schema_change>(transport::event::schema_change::change_type::CREATED, transport::event::schema_change::target_type::TABLE, keyspace(), column_family());
}
shared_ptr<cql3::statements::prepared_statement>
create_view_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<create_view_statement>(*this));
}
}
}

View File

@@ -0,0 +1,79 @@
/*
* This file is part of Scylla.
* Copyright (C) 2016 ScyllaDB
*
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/statements/schema_altering_statement.hh"
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/statements/cf_properties.hh"
#include "cql3/cql3_type.hh"
#include "cql3/selection/raw_selector.hh"
#include "cql3/relation.hh"
#include "cql3/cf_name.hh"
#include "service/migration_manager.hh"
#include "schema.hh"
#include "core/shared_ptr.hh"
#include <utility>
#include <vector>
#include <experimental/optional>
namespace cql3 {
namespace statements {
/** A <code>CREATE MATERIALIZED VIEW</code> parsed from a CQL query statement. */
class create_view_statement : public schema_altering_statement {
private:
::shared_ptr<cf_name> _base_name;
std::vector<::shared_ptr<selection::raw_selector>> _select_clause;
std::vector<::shared_ptr<relation>> _where_clause;
std::vector<::shared_ptr<cql3::column_identifier::raw>> _partition_keys;
std::vector<::shared_ptr<cql3::column_identifier::raw>> _clustering_keys;
cf_properties _properties;
bool _if_not_exists;
public:
create_view_statement(
::shared_ptr<cf_name> view_name,
::shared_ptr<cf_name> base_name,
std::vector<::shared_ptr<selection::raw_selector>> select_clause,
std::vector<::shared_ptr<relation>> where_clause,
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys,
std::vector<::shared_ptr<cql3::column_identifier::raw>> clustering_keys,
bool if_not_exists);
auto& properties() {
return _properties;
}
// Functions we need to override to subclass schema_altering_statement
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
// FIXME: continue here. See create_table_statement.hh and CreateViewStatement.java
};
}
}

View File

@@ -46,8 +46,8 @@ namespace cql3 {
namespace statements {
delete_statement::delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs)
: modification_statement{type, bound_terms, std::move(s), std::move(attrs)}
delete_statement::delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, cql_stats& stats)
: modification_statement{type, bound_terms, std::move(s), std::move(attrs), &stats.deletes}
{ }
bool delete_statement::require_full_clustering_key() const {
@@ -80,10 +80,10 @@ void delete_statement::add_update_for_key(mutation& m, const exploded_clustering
namespace raw {
::shared_ptr<cql3::statements::modification_statement>
delete_statement::prepare_internal(database& db, schema_ptr schema, ::shared_ptr<variable_specifications> bound_names,
std::unique_ptr<attributes> attrs) {
delete_statement::prepare_internal(database& db, schema_ptr schema, shared_ptr<variable_specifications> bound_names,
std::unique_ptr<attributes> attrs, cql_stats& stats) {
using statement_type = cql3::statements::modification_statement::statement_type;
auto stmt = ::make_shared<cql3::statements::delete_statement>(statement_type::DELETE, bound_names->size(), schema, std::move(attrs));
auto stmt = ::make_shared<cql3::statements::delete_statement>(statement_type::DELETE, bound_names->size(), schema, std::move(attrs), stats);
for (auto&& deletion : _deletions) {
auto&& id = deletion->affected_column()->prepare_column_identifier(schema);

View File

@@ -56,7 +56,7 @@ namespace statements {
*/
class delete_statement : public modification_statement {
public:
delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs);
delete_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, cql_stats& stats);
virtual bool require_full_clustering_key() const override;

View File

@@ -99,7 +99,7 @@ shared_ptr<transport::event::schema_change> drop_keyspace_statement::change_even
}
shared_ptr<cql3::statements::prepared_statement>
drop_keyspace_statement::prepare(database& db) {
drop_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_keyspace_statement>(*this));
}

View File

@@ -63,7 +63,7 @@ public:
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -100,7 +100,7 @@ shared_ptr<transport::event::schema_change> drop_table_statement::change_event()
}
shared_ptr<cql3::statements::prepared_statement>
drop_table_statement::prepare(database& db) {
drop_table_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_table_statement>(*this));
}

View File

@@ -62,7 +62,7 @@ public:
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -118,7 +118,7 @@ future<bool> drop_type_statement::announce_migration(distributed<service::storag
}
shared_ptr<cql3::statements::prepared_statement>
drop_type_statement::prepare(database& db) {
drop_type_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<drop_type_statement>(*this));
}

View File

@@ -65,7 +65,7 @@ public:
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -73,12 +73,13 @@ operator<<(std::ostream& out, modification_statement::statement_type t) {
return out;
}
modification_statement::modification_statement(statement_type type_, uint32_t bound_terms, schema_ptr schema_, std::unique_ptr<attributes> attrs_)
modification_statement::modification_statement(statement_type type_, uint32_t bound_terms, schema_ptr schema_, std::unique_ptr<attributes> attrs_, uint64_t* cql_stats_counter_ptr)
: type{type_}
, _bound_terms{bound_terms}
, s{schema_}
, attrs{std::move(attrs_)}
, _column_operations{}
, _cql_modification_counter_ptr(cql_stats_counter_ptr)
{ }
bool modification_statement::uses_function(const sstring& ks_name, const sstring& function_name) const {
@@ -453,6 +454,8 @@ modification_statement::execute(distributed<service::storage_proxy>& proxy, serv
return execute_with_condition(proxy, qs, options);
}
inc_cql_stats();
return execute_without_condition(proxy, qs, options).then([] {
return make_ready_future<::shared_ptr<transport::messages::result_message>>(
::shared_ptr<transport::messages::result_message>{});
@@ -513,6 +516,8 @@ modification_statement::execute_internal(distributed<service::storage_proxy>& pr
tracing::add_table_name(qs.get_trace_state(), keyspace(), column_family());
inc_cql_stats();
return get_mutations(proxy, options, true, options.get_timestamp(qs), qs.get_trace_state()).then(
[&proxy] (auto mutations) {
return proxy.local().mutate_locally(std::move(mutations));
@@ -573,20 +578,20 @@ modification_statement::process_where_clause(database& db, std::vector<relation_
namespace raw {
::shared_ptr<prepared_statement>
modification_statement::modification_statement::prepare(database& db) {
modification_statement::modification_statement::prepare(database& db, cql_stats& stats) {
auto bound_names = get_bound_variables();
auto statement = prepare(db, bound_names);
auto statement = prepare(db, bound_names, stats);
return ::make_shared<prepared>(std::move(statement), *bound_names);
}
::shared_ptr<cql3::statements::modification_statement>
modification_statement::prepare(database& db, ::shared_ptr<variable_specifications> bound_names) {
modification_statement::prepare(database& db, ::shared_ptr<variable_specifications> bound_names, cql_stats& stats) {
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
auto prepared_attributes = _attrs->prepare(db, keyspace(), column_family());
prepared_attributes->collect_marker_specification(bound_names);
::shared_ptr<cql3::statements::modification_statement> stmt = prepare_internal(db, schema, bound_names, std::move(prepared_attributes));
::shared_ptr<cql3::statements::modification_statement> stmt = prepare_internal(db, schema, bound_names, std::move(prepared_attributes), stats);
if (_if_not_exists || _if_exists || !_conditions.empty()) {
if (stmt->is_counter()) {

View File

@@ -109,8 +109,10 @@ private:
return cond->column;
};
uint64_t* _cql_modification_counter_ptr = nullptr;
public:
modification_statement(statement_type type_, uint32_t bound_terms, schema_ptr schema_, std::unique_ptr<attributes> attrs_);
modification_statement(statement_type type_, uint32_t bound_terms, schema_ptr schema_, std::unique_ptr<attributes> attrs_, uint64_t* cql_stats_counter_ptr);
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
@@ -152,6 +154,11 @@ public:
staticConditions == null ? Collections.<ColumnDefinition>emptyList() : Iterables.transform(staticConditions, getColumnForCondition));
}
#endif
void inc_cql_stats() {
++(*_cql_modification_counter_ptr);
}
public:
void add_condition(::shared_ptr<column_condition> cond);

View File

@@ -84,7 +84,7 @@ public:
}
}
virtual shared_ptr<prepared> prepare(database& db) override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -66,7 +66,7 @@ public:
bool if_exists);
protected:
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs);
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats);
};
}

View File

@@ -77,7 +77,7 @@ public:
bool if_not_exists);
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs) override;
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats) override;
};

View File

@@ -83,11 +83,11 @@ protected:
modification_statement(::shared_ptr<cf_name> name, ::shared_ptr<attributes::raw> attrs, conditions_vector conditions, bool if_not_exists, bool if_exists);
public:
virtual ::shared_ptr<prepared> prepare(database& db) override;
::shared_ptr<cql3::statements::modification_statement> prepare(database& db, ::shared_ptr<variable_specifications> bound_names);;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
::shared_ptr<cql3::statements::modification_statement> prepare(database& db, ::shared_ptr<variable_specifications> bound_names, cql_stats& stats);
protected:
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs) = 0;
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats) = 0;
};
}

View File

@@ -44,6 +44,7 @@
#include "cql3/variable_specifications.hh"
#include "cql3/column_specification.hh"
#include "cql3/column_identifier.hh"
#include "cql3/stats.hh"
#include <seastar/core/shared_ptr.hh>
@@ -70,7 +71,7 @@ public:
void set_bound_variables(const std::vector<::shared_ptr<column_identifier>>& bound_names);
virtual ::shared_ptr<prepared> prepare(database& db) = 0;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const;
};

View File

@@ -100,7 +100,7 @@ public:
std::vector<::shared_ptr<relation>> where_clause,
::shared_ptr<term::raw> limit);
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual ::shared_ptr<prepared> prepare(database& db,cql_stats& stats) override;
private:
::shared_ptr<restrictions::statement_restrictions> prepare_restrictions(
database& db,

View File

@@ -81,7 +81,7 @@ public:
conditions_vector conditions);
protected:
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs);
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats);
};
}

View File

@@ -58,7 +58,7 @@ private:
public:
use_statement(sstring keyspace);
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual ::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -87,7 +87,8 @@ select_statement::select_statement(schema_ptr schema,
::shared_ptr<restrictions::statement_restrictions> restrictions,
bool is_reversed,
ordering_comparator_type ordering_comparator,
::shared_ptr<term> limit)
::shared_ptr<term> limit,
cql_stats& stats)
: _schema(schema)
, _bound_terms(bound_terms)
, _parameters(std::move(parameters))
@@ -96,6 +97,7 @@ select_statement::select_statement(schema_ptr schema,
, _is_reversed(is_reversed)
, _limit(std::move(limit))
, _ordering_comparator(std::move(ordering_comparator))
, _stats(stats)
{
_opts = _selection->get_query_options();
}
@@ -107,7 +109,7 @@ bool select_statement::uses_function(const sstring& ks_name, const sstring& func
}
::shared_ptr<select_statement>
select_statement::for_selection(schema_ptr schema, ::shared_ptr<selection::selection> selection) {
select_statement::for_selection(schema_ptr schema, ::shared_ptr<selection::selection> selection, cql_stats& stats) {
return ::make_shared<select_statement>(schema,
0,
_default_parameters,
@@ -115,7 +117,8 @@ select_statement::for_selection(schema_ptr schema, ::shared_ptr<selection::selec
::make_shared<restrictions::statement_restrictions>(schema),
false,
ordering_comparator_type{},
::shared_ptr<term>{});
::shared_ptr<term>{},
stats);
}
::shared_ptr<const cql3::metadata> select_statement::get_result_metadata() const {
@@ -227,6 +230,8 @@ select_statement::execute(distributed<service::storage_proxy>& proxy,
int32_t limit = get_limit(options);
auto now = db_clock::now();
++_stats.reads;
auto command = ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(),
make_partition_slice(options), limit, to_gc_clock(now), tracing::make_trace_info(state.get_trace_state()), query::max_partitions, options.get_timestamp(state));
@@ -249,6 +254,7 @@ select_statement::execute(distributed<service::storage_proxy>& proxy,
now);
}
command->slice.options.set<query::partition_slice::option::allow_short_read>();
auto p = service::pager::query_pagers::pager(_schema, _selection,
state, options, command, std::move(key_ranges));
@@ -301,7 +307,8 @@ select_statement::execute(distributed<service::storage_proxy>& proxy,
// doing post-query ordering.
if (needs_post_query_ordering() && _limit) {
return do_with(std::forward<std::vector<query::partition_range>>(partition_ranges), [this, &proxy, &state, &options, cmd](auto prs) {
query::result_merger merger;
assert(cmd->partition_limit == query::max_partitions);
query::result_merger merger(cmd->row_limit * prs.size(), query::max_partitions);
return map_reduce(prs.begin(), prs.end(), [this, &proxy, &state, &options, cmd] (auto pr) {
std::vector<query::partition_range> prange { pr };
auto command = ::make_lw_shared<query::read_command>(*cmd);
@@ -331,9 +338,12 @@ select_statement::execute_internal(distributed<service::storage_proxy>& proxy,
tracing::add_table_name(state.get_trace_state(), keyspace(), column_family());
++_stats.reads;
if (needs_post_query_ordering() && _limit) {
return do_with(std::move(partition_ranges), [this, &proxy, &state, command] (auto prs) {
query::result_merger merger;
assert(command->partition_limit == query::max_partitions);
query::result_merger merger(command->row_limit * prs.size(), query::max_partitions);
return map_reduce(prs.begin(), prs.end(), [this, &proxy, &state, command] (auto pr) {
std::vector<query::partition_range> prange { pr };
auto cmd = ::make_lw_shared<query::read_command>(*command);
@@ -367,8 +377,8 @@ select_statement::process_results(foreign_ptr<lw_shared_ptr<query::result>> resu
if (_is_reversed) {
rs->reverse();
}
rs->trim(cmd->row_limit);
}
rs->trim(cmd->row_limit);
return ::make_shared<transport::messages::result_message::rows>(std::move(rs));
}
@@ -386,7 +396,7 @@ select_statement::select_statement(::shared_ptr<cf_name> cf_name,
, _limit(std::move(limit))
{ }
::shared_ptr<prepared_statement> select_statement::prepare(database& db) {
::shared_ptr<prepared_statement> select_statement::prepare(database& db, cql_stats& stats) {
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
auto bound_names = get_bound_variables();
@@ -418,7 +428,8 @@ select_statement::select_statement(::shared_ptr<cf_name> cf_name,
std::move(restrictions),
is_reversed_,
std::move(ordering_comparator),
prepare_limit(db, bound_names));
prepare_limit(db, bound_names),
stats);
return ::make_shared<prepared>(std::move(stmt), std::move(*bound_names));
}

View File

@@ -89,6 +89,7 @@ private:
ordering_comparator_type _ordering_comparator;
query::partition_slice::option_set _opts;
cql_stats& _stats;
public:
select_statement(schema_ptr schema,
uint32_t bound_terms,
@@ -97,7 +98,8 @@ public:
::shared_ptr<restrictions::statement_restrictions> restrictions,
bool is_reversed,
ordering_comparator_type ordering_comparator,
::shared_ptr<term> limit);
::shared_ptr<term> limit,
cql_stats& stats);
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
@@ -105,7 +107,7 @@ public:
// Note that the results select statement should not be used for actual queries, but only for processing already
// queried data through processColumnFamily.
static ::shared_ptr<select_statement> for_selection(
schema_ptr schema, ::shared_ptr<selection::selection> selection);
schema_ptr schema, ::shared_ptr<selection::selection> selection, cql_stats& stats);
virtual ::shared_ptr<const cql3::metadata> get_result_metadata() const override;
virtual uint32_t get_bound_terms() override;

View File

@@ -59,7 +59,7 @@ uint32_t truncate_statement::get_bound_terms()
return 0;
}
::shared_ptr<prepared_statement> truncate_statement::prepare(database& db)
::shared_ptr<prepared_statement> truncate_statement::prepare(database& db,cql_stats& stats)
{
return ::make_shared<prepared>(this->shared_from_this());
}

View File

@@ -56,7 +56,7 @@ public:
virtual uint32_t get_bound_terms() override;
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual ::shared_ptr<prepared> prepare(database& db,cql_stats& stats) override;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;

View File

@@ -50,8 +50,8 @@ namespace cql3 {
namespace statements {
update_statement::update_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs)
: modification_statement{type, bound_terms, std::move(s), std::move(attrs)}
update_statement::update_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, uint64_t* cql_stats_counter_ptr)
: modification_statement{type, bound_terms, std::move(s), std::move(attrs), cql_stats_counter_ptr}
{ }
bool update_statement::require_full_clustering_key() const {
@@ -124,10 +124,10 @@ insert_statement::insert_statement( ::shared_ptr<cf_name> name,
::shared_ptr<cql3::statements::modification_statement>
insert_statement::prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs)
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats)
{
using statement_type = cql3::statements::modification_statement::statement_type;
auto stmt = ::make_shared<cql3::statements::update_statement>(statement_type::INSERT, bound_names->size(), schema, std::move(attrs));
auto stmt = ::make_shared<cql3::statements::update_statement>(statement_type::INSERT, bound_names->size(), schema, std::move(attrs), &stats.inserts);
// Created from an INSERT
if (stmt->is_counter()) {
@@ -181,10 +181,10 @@ update_statement::update_statement( ::shared_ptr<cf_name> name,
::shared_ptr<cql3::statements::modification_statement>
update_statement::prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs)
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats)
{
using statement_type = cql3::statements::modification_statement::statement_type;
auto stmt = ::make_shared<cql3::statements::update_statement>(statement_type::UPDATE, bound_names->size(), schema, std::move(attrs));
auto stmt = ::make_shared<cql3::statements::update_statement>(statement_type::UPDATE, bound_names->size(), schema, std::move(attrs), &stats.updates);
for (auto&& entry : _updates) {
auto id = entry.first->prepare_column_identifier(schema);

View File

@@ -65,7 +65,7 @@ public:
private static final Constants.Value EMPTY = new Constants.Value(ByteBufferUtil.EMPTY_BYTE_BUFFER);
#endif
update_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs);
update_statement(statement_type type, uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, uint64_t* cql_stats_counter_ptr);
private:
virtual bool require_full_clustering_key() const override;

View File

@@ -65,7 +65,7 @@ use_statement::use_statement(sstring keyspace)
{
}
::shared_ptr<prepared_statement> use_statement::prepare(database& db)
::shared_ptr<prepared_statement> use_statement::prepare(database& db, cql_stats& stats)
{
return ::make_shared<prepared>(make_shared<cql3::statements::use_statement>(_keyspace));
}

36
cql3/stats.hh Normal file
View File

@@ -0,0 +1,36 @@
/*
* Copyright (C) 2015 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
namespace cql3 {
struct cql_stats {
uint64_t reads = 0;
uint64_t inserts = 0;
uint64_t updates = 0;
uint64_t deletes = 0;
uint64_t batches = 0;
};
}

File diff suppressed because it is too large Load Diff

View File

@@ -70,10 +70,10 @@
#include "utils/estimated_histogram.hh"
#include "sstables/compaction.hh"
#include "sstables/sstable_set.hh"
#include "key_reader.hh"
#include <seastar/core/rwlock.hh>
#include <seastar/core/shared_future.hh>
#include "tracing/trace_state.hh"
#include <boost/intrusive/parent_from_member.hpp>
class frozen_mutation;
class reconcilable_result;
@@ -118,44 +118,93 @@ class dirty_memory_manager: public logalloc::region_group_reclaimer {
// memtable is totally gone. That means that if we have throttled requests, they will stay
// throttled for a long time. Even when we have virtual dirty, that only provides a rough
// estimate, and we can't release requests that early.
//
// Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind
// would take care of the rest. But that still has issues, so we'll limit parallelism to some
// number (4), that we will hopefully reduce to 1 when write behind works.
//
// When streaming is going on, we'll separate half of that for the streaming code, which
// effectively increases the total to 6. That is a bit ugly and a bit redundant with the I/O
// Scheduler, but it's the easiest way not to hurt the common case (no streaming) and will have
// to do for the moment. Hopefully we can set both to 1 soon (with write behind)
//
// FIXME: enable write behind and set both to 1. Right now we will take advantage of the fact
// that memtables and streaming will use different specialized classes here and set them as
// default values here.
size_t _concurrency;
semaphore _flush_serializer;
// We will accept a new flush before another one ends, once it is done with the data write.
// That is so we can keep the disk always busy. But there is still some background work that is
// left to be done. Mostly, update the caches and seal the auxiliary components of the SSTable.
// This semaphore will cap the amount of background work that we have. Note that we're not
// overly concerned about memtable memory, because dirty memory will put a limit to that. This
// is mostly about dangling continuations. So that doesn't have to be a small number.
static constexpr unsigned _max_background_work = 20;
semaphore _background_work_flush_serializer = { _max_background_work };
condition_variable _should_flush;
int64_t _dirty_bytes_released_pre_accounted = 0;
seastar::gate _waiting_flush_gate;
std::vector<shared_memtable> _pending_flushes;
void maybe_do_active_flush();
protected:
virtual memtable_list& get_memtable_list(column_family& cf) = 0;
virtual void start_reclaiming() override;
future<> flush_when_needed();
struct flush_permit {
semaphore_units<> permit;
flush_permit(semaphore_units<>&& permit) : permit(std::move(permit)) {}
};
// We need to start a flush before the current one finishes, otherwise
// we'll have a period without significant disk activity when the current
// SSTable is being sealed, the caches are being updated, etc. To do that
// we need to keep track of who is it that we are flushing this memory from.
std::unordered_map<const logalloc::region*, flush_permit> _flush_manager;
future<> _waiting_flush;
virtual void start_reclaiming() noexcept override;
bool has_pressure() const {
return over_soft_limit();
}
std::vector<scollectd::registration> _collectd;
public:
void setup_collectd(sstring namestr);
future<> shutdown();
dirty_memory_manager(database* db, size_t threshold, size_t concurrency)
: logalloc::region_group_reclaimer(threshold)
, _db(db)
, _region_group(*this)
, _concurrency(concurrency)
, _flush_serializer(concurrency) {}
// Limits and pressure conditions:
// ===============================
//
// Virtual Dirty
// -------------
// We can't free memory until the whole memtable is flushed because we need to keep it in memory
// until the end, but we can fake freeing memory. When we are done with an element of the
// memtable, we will update the region group pretending memory just went down by that amount.
//
// Because the amount of memory that we pretend to free should be close enough to the actual
// memory used by the memtables, that effectively creates two sub-regions inside the dirty
// region group, of equal size. In the worst case, we will have <memtable_total_space> dirty
// bytes used, and half of that already virtually freed.
//
// Hard Limit
// ----------
// The total space that can be used by memtables in each group is defined by the threshold, but
// we will only allow the region_group to grow to half of that. This is because of virtual_dirty
// as explained above. Because virtual dirty is implemented by reducing the usage in the
// region_group directly on partition written, we want to throttle every time half of the memory
// as seen by the region_group. To achieve that we need to set the hard limit (first parameter
// of the region_group_reclaimer) to 1/2 of the user-supplied threshold
//
// Soft Limit
// ----------
// When the soft limit is hit, no throttle happens. The soft limit exists because we don't want
// to start flushing only when the limit is hit, but a bit earlier instead. If we were to start
// flushing only when the hard limit is hit, workloads in which the disk is fast enough to cope
// would see latency added to some requests unnecessarily.
//
// We then set the soft limit to 80 % of the virtual dirty hard limit, which is equal to 40 % of
// the user-supplied threshold.
dirty_memory_manager(database& db, size_t threshold)
: logalloc::region_group_reclaimer(threshold / 2, threshold * 0.40)
, _db(&db)
, _region_group(*this)
, _flush_serializer(1)
, _waiting_flush(flush_when_needed()) {}
dirty_memory_manager() : logalloc::region_group_reclaimer()
, _db(nullptr)
, _region_group(*this)
, _flush_serializer(1)
, _waiting_flush(make_ready_future<>()) {}
static dirty_memory_manager& from_region_group(logalloc::region_group *rg) {
return *(boost::intrusive::get_parent_from_member(rg, &dirty_memory_manager::_region_group));
}
dirty_memory_manager(database* db, dirty_memory_manager *parent, size_t threshold, size_t concurrency)
: logalloc::region_group_reclaimer(threshold)
, _db(db)
, _region_group(&parent->_region_group, *this)
, _concurrency(concurrency)
, _flush_serializer(concurrency) {}
logalloc::region_group& region_group() {
return _region_group;
}
@@ -164,33 +213,52 @@ public:
return _region_group;
}
template <typename Func>
future<> serialize_flush(Func&& func) {
return seastar::with_gate(_waiting_flush_gate, [this, func] () mutable {
return with_semaphore(_flush_serializer, 1, func).finally([this] {
maybe_do_active_flush();
});
});
void revert_potentially_cleaned_up_memory(logalloc::region* from, int64_t delta) {
_region_group.update(delta);
_dirty_bytes_released_pre_accounted -= delta;
}
void account_potentially_cleaned_up_memory(logalloc::region* from, int64_t delta) {
_region_group.update(-delta);
_dirty_bytes_released_pre_accounted += delta;
}
// This can be called multiple times during the lifetime of the region, and should always
// ultimately be called after the flush ends. However, some flushers may decide to call it
// earlier. For instance, the normal memtables sealing function will call this before updating
// the cache.
//
// Also, for sealing methods like the normal memtable sealing method - that may retry after a
// failed write, calling this method after the attempt is completed with success or failure is
// mandatory. That's because the new attempt will create a new flush reader for the same
// SSTable, so we need to make sure that we revert the old charges.
void remove_from_flush_manager(const logalloc::region *region) {
auto it = _flush_manager.find(region);
if (it != _flush_manager.end()) {
_flush_manager.erase(it);
}
}
void add_to_flush_manager(const logalloc::region *region, flush_permit&& permit) {
_flush_manager.emplace(region, std::move(permit));
}
size_t real_dirty_memory() const {
return _region_group.memory_used() + _dirty_bytes_released_pre_accounted;
}
size_t virtual_dirty_memory() const {
return _region_group.memory_used();
}
future<> flush_one(memtable_list& cf, semaphore_units<> permit);
future<semaphore_units<>> get_flush_permit() {
return get_units(_flush_serializer, 1);
}
};
class streaming_dirty_memory_manager: public dirty_memory_manager {
virtual memtable_list& get_memtable_list(column_family& cf) override;
public:
streaming_dirty_memory_manager(database& db, dirty_memory_manager *parent, size_t threshold) : dirty_memory_manager(&db, parent, threshold, 2) {}
};
class memtable_dirty_memory_manager: public dirty_memory_manager {
virtual memtable_list& get_memtable_list(column_family& cf) override;
public:
memtable_dirty_memory_manager(database& db, dirty_memory_manager* parent, size_t threshold) : dirty_memory_manager(&db, parent, threshold, 4) {}
// This constructor will be called for the system tables (no parent). Its flushes are usually drive by us
// and not the user, and tend to be small in size. So we'll allow only two slots.
memtable_dirty_memory_manager(database& db, size_t threshold) : dirty_memory_manager(&db, threshold, 2) {}
memtable_dirty_memory_manager() : dirty_memory_manager(nullptr, std::numeric_limits<size_t>::max(), 4) {}
};
extern thread_local memtable_dirty_memory_manager default_dirty_memory_manager;
extern thread_local dirty_memory_manager default_dirty_memory_manager;
// We could just add all memtables, regardless of types, to a single list, and
// then filter them out when we read them. Here's why I have chosen not to do
@@ -217,18 +285,29 @@ private:
std::vector<shared_memtable> _memtables;
std::function<future<> (flush_behavior)> _seal_fn;
std::function<schema_ptr()> _current_schema;
size_t _max_memtable_size;
dirty_memory_manager* _dirty_memory_manager;
std::experimental::optional<shared_promise<>> _flush_coalescing;
public:
memtable_list(std::function<future<> (flush_behavior)> seal_fn, std::function<schema_ptr()> cs, size_t max_memtable_size, dirty_memory_manager* dirty_memory_manager)
memtable_list(std::function<future<> (flush_behavior)> seal_fn, std::function<schema_ptr()> cs, dirty_memory_manager* dirty_memory_manager)
: _memtables({})
, _seal_fn(seal_fn)
, _current_schema(cs)
, _max_memtable_size(max_memtable_size)
, _dirty_memory_manager(dirty_memory_manager) {
add_memtable();
}
memtable_list(std::function<schema_ptr()> cs, dirty_memory_manager* dirty_memory_manager)
: _memtables({})
, _seal_fn()
, _current_schema(cs)
, _dirty_memory_manager(dirty_memory_manager) {
add_memtable();
}
bool may_flush() const {
return bool(_seal_fn);
}
shared_memtable back() {
return _memtables.back();
}
@@ -273,21 +352,16 @@ public:
_memtables.emplace_back(new_memtable());
}
bool should_flush() {
return active_memtable().occupancy().total_space() >= _max_memtable_size;
}
void seal_on_overflow() {
if (should_flush()) {
// FIXME: if sparse, do some in-memory compaction first
// FIXME: maybe merge with other in-memory memtables
seal_active_memtable(flush_behavior::immediate);
}
logalloc::region_group& region_group() {
return _dirty_memory_manager->region_group();
}
// This is used for explicit flushes. Will queue the memtable for flushing and proceed when the
// dirty_memory_manager allows us to. We will not seal at this time since the flush itself
// wouldn't happen anyway. Keeping the memtable in memory will potentially increase the time it
// spends in memory allowing for more coalescing opportunities.
future<> request_flush();
private:
lw_shared_ptr<memtable> new_memtable() {
return make_lw_shared<memtable>(_current_schema(), &(_dirty_memory_manager->region_group()));
}
lw_shared_ptr<memtable> new_memtable();
};
using sstable_list = sstables::sstable_list;
@@ -320,11 +394,10 @@ public:
bool enable_cache = true;
bool enable_commitlog = true;
bool enable_incremental_backups = false;
size_t max_memtable_size = 5'000'000;
size_t max_streaming_memtable_size = 5'000'000;
::dirty_memory_manager* dirty_memory_manager = &default_dirty_memory_manager;
::dirty_memory_manager* streaming_dirty_memory_manager = &default_dirty_memory_manager;
restricted_mutation_reader_config read_concurrency_config;
restricted_mutation_reader_config streaming_read_concurrency_config;
::cf_stats* cf_stats = nullptr;
uint64_t max_cached_partition_size_in_bytes;
};
@@ -379,9 +452,6 @@ private:
lw_shared_ptr<memtable_list> _streaming_memtables;
utils::phased_barrier _streaming_flush_phaser;
friend class memtable_dirty_memory_manager;
friend class streaming_dirty_memory_manager;
// If mutations are fragmented during streaming the sstables cannot be made
// visible immediately after memtable flush, because that could cause
// readers to see only a part of a partition thus violating isolation
@@ -444,7 +514,10 @@ private:
void update_stats_for_new_sstable(uint64_t disk_space_used_by_sstable);
void add_sstable(sstables::sstable&& sstable);
void add_sstable(lw_shared_ptr<sstables::sstable> sstable);
future<> load_sstable(sstables::sstable&& sstab, bool reset_level = false);
// returns an empty pointer if sstable doesn't belong to current shard.
future<lw_shared_ptr<sstables::sstable>> open_sstable(sstring dir, int64_t generation,
sstables::sstable::version_types v, sstables::sstable::format_types f);
void load_sstable(lw_shared_ptr<sstables::sstable>& sstable, bool reset_level = false);
lw_shared_ptr<memtable> new_memtable();
lw_shared_ptr<memtable> new_streaming_memtable();
future<stop_iteration> try_flush_memtable_to_sstable(lw_shared_ptr<memtable> memt);
@@ -471,6 +544,8 @@ private:
const std::vector<sstables::shared_sstable>& sstables_to_remove);
void rebuild_statistics();
private:
using virtual_reader_type = std::function<mutation_reader(schema_ptr, const query::partition_range&, const query::partition_slice&, const io_priority_class&, tracing::trace_state_ptr)>;
virtual_reader_type _virtual_reader;
// Creates a mutation reader which covers sstables.
// Caller needs to ensure that column_family remains live (FIXME: relax this).
// The 'range' parameter must be live as long as the reader is used.
@@ -482,7 +557,6 @@ private:
tracing::trace_state_ptr trace_state) const;
mutation_source sstables_as_mutation_source();
key_source sstables_as_key_source() const;
partition_presence_checker make_partition_presence_checker(sstables::shared_sstable exclude_sstable);
std::chrono::steady_clock::time_point _sstable_writes_disabled_at;
void do_trigger_compaction();
@@ -528,8 +602,16 @@ public:
mutation_reader make_streaming_reader(schema_ptr schema,
const query::partition_range& range = query::full_partition_range) const;
// Requires ranges to be sorted and disjoint.
mutation_reader make_streaming_reader(schema_ptr schema,
const std::vector<query::partition_range>& ranges) const;
mutation_source as_mutation_source(tracing::trace_state_ptr trace_state) const;
void set_virtual_reader(virtual_reader_type virtual_reader) {
_virtual_reader = std::move(virtual_reader);
}
// Queries can be satisfied from multiple data sources, so they are returned
// as temporaries.
//
@@ -571,7 +653,9 @@ public:
future<lw_shared_ptr<query::result>> query(schema_ptr,
const query::read_command& cmd, query::result_request request,
const std::vector<query::partition_range>& ranges,
tracing::trace_state_ptr trace_state);
tracing::trace_state_ptr trace_state,
query::result_memory_limiter& memory_limiter,
uint64_t max_result_size);
future<> populate(sstring datadir);
@@ -669,8 +753,10 @@ public:
_config.enable_incremental_backups = val;
}
const sstables::sstable_set& get_sstable_set() const;
lw_shared_ptr<sstable_list> get_sstables() const;
lw_shared_ptr<sstable_list> get_sstables_including_compacted_undeleted() const;
const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const;
std::vector<sstables::shared_sstable> select_sstables(const query::partition_range& range) const;
size_t sstables_count() const;
std::vector<uint64_t> sstable_count_per_level() const;
@@ -743,7 +829,7 @@ private:
// repair can now choose whatever strategy - small or big ranges - it wants, resting assure
// that the incoming memtables will be coalesced together.
shared_promise<> _waiting_streaming_flushes;
timer<> _delayed_streaming_flush{[this] { seal_active_streaming_memtable_immediate(); }};
timer<> _delayed_streaming_flush{[this] { _streaming_memtables->request_flush(); }};
future<> seal_active_streaming_memtable_delayed();
future<> seal_active_streaming_memtable_immediate();
future<> seal_active_streaming_memtable(memtable_list::flush_behavior behavior) {
@@ -874,11 +960,10 @@ public:
bool enable_disk_writes = true;
bool enable_cache = true;
bool enable_incremental_backups = false;
size_t max_memtable_size = 5'000'000;
size_t max_streaming_memtable_size = 5'000'000;
::dirty_memory_manager* dirty_memory_manager = &default_dirty_memory_manager;
::dirty_memory_manager* streaming_dirty_memory_manager = &default_dirty_memory_manager;
restricted_mutation_reader_config read_concurrency_config;
restricted_mutation_reader_config streaming_read_concurrency_config;
::cf_stats* cf_stats = nullptr;
};
private:
@@ -957,23 +1042,32 @@ public:
// use shard_of() for data
class database {
public:
using timeout_clock = std::chrono::steady_clock;
private:
::cf_stats _cf_stats;
static constexpr size_t max_concurrent_reads() { return 100; }
static constexpr size_t max_system_concurrent_reads() { return 10; }
struct db_stats {
uint64_t total_writes = 0;
uint64_t total_writes_failed = 0;
uint64_t total_writes_timedout = 0;
uint64_t total_reads = 0;
uint64_t total_reads_failed = 0;
uint64_t sstable_read_queue_overloaded = 0;
uint64_t short_data_queries = 0;
uint64_t short_mutation_queries = 0;
};
lw_shared_ptr<db_stats> _stats;
std::unique_ptr<db::config> _cfg;
size_t _memtable_total_space = 500 << 20;
size_t _streaming_memtable_total_space = 500 << 20;
memtable_dirty_memory_manager _system_dirty_memory_manager;
memtable_dirty_memory_manager _dirty_memory_manager;
streaming_dirty_memory_manager _streaming_dirty_memory_manager;
dirty_memory_manager _system_dirty_memory_manager;
dirty_memory_manager _dirty_memory_manager;
dirty_memory_manager _streaming_dirty_memory_manager;
semaphore _read_concurrency_sem{max_concurrent_reads()};
restricted_mutation_reader_config _read_concurrency_config;
semaphore _system_read_concurrency_sem{max_system_concurrent_reads()};
@@ -990,7 +1084,7 @@ class database {
bool _enable_incremental_backups = false;
future<> init_commitlog();
future<> apply_in_memory(const frozen_mutation& m, schema_ptr m_schema, db::replay_position);
future<> apply_in_memory(const frozen_mutation& m, schema_ptr m_schema, db::replay_position, timeout_clock::time_point timeout);
future<> populate(sstring datadir);
future<> populate_keyspace(sstring datadir, sstring ks_name);
@@ -1002,10 +1096,16 @@ private:
friend void db::system_keyspace::make(database& db, bool durable, bool volatile_testing_only);
void setup_collectd();
future<> do_apply(schema_ptr, const frozen_mutation&);
future<> do_apply(schema_ptr, const frozen_mutation&, timeout_clock::time_point timeout);
query::result_memory_limiter _result_memory_limiter;
public:
static utils::UUID empty_version;
query::result_memory_limiter& get_result_memory_limiter() {
return _result_memory_limiter;
}
void set_enable_incremental_backups(bool val) { _enable_incremental_backups = val; }
future<> parse_system_tables(distributed<service::storage_proxy>&);
@@ -1067,9 +1167,13 @@ public:
unsigned shard_of(const dht::token& t);
unsigned shard_of(const mutation& m);
unsigned shard_of(const frozen_mutation& m);
future<lw_shared_ptr<query::result>> query(schema_ptr, const query::read_command& cmd, query::result_request request, const std::vector<query::partition_range>& ranges, tracing::trace_state_ptr trace_state);
future<reconcilable_result> query_mutations(schema_ptr, const query::read_command& cmd, const query::partition_range& range, tracing::trace_state_ptr trace_state);
future<> apply(schema_ptr, const frozen_mutation&);
future<lw_shared_ptr<query::result>> query(schema_ptr, const query::read_command& cmd, query::result_request request, const std::vector<query::partition_range>& ranges,
tracing::trace_state_ptr trace_state, uint64_t max_result_size);
future<reconcilable_result> query_mutations(schema_ptr, const query::read_command& cmd, const query::partition_range& range,
query::result_memory_accounter&& accounter, tracing::trace_state_ptr trace_state);
// Apply the mutation atomically.
// Throws timed_out_error when timeout is reached.
future<> apply(schema_ptr, const frozen_mutation&, timeout_clock::time_point timeout = timeout_clock::time_point::max());
future<> apply_streaming_mutation(schema_ptr, utils::UUID plan_id, const frozen_mutation&, bool fragmented);
keyspace::config make_keyspace_config(const keyspace_metadata& ksm);
const sstring& get_snitch_name() const;

View File

@@ -23,6 +23,7 @@
// database.hh
class database;
class memtable_list;
// mutation.hh
class mutation;

View File

@@ -66,6 +66,7 @@
#include "serialization_visitors.hh"
#include "idl/uuid.dist.impl.hh"
#include "idl/frozen_schema.dist.impl.hh"
#include "message/messaging_service.hh"
static logging::logger logger("batchlog_manager");

View File

@@ -46,6 +46,7 @@
#include <boost/range/adaptor/map.hpp>
#include <unordered_map>
#include <unordered_set>
#include <exception>
#include <core/align.hh>
#include <core/reactor.hh>
@@ -56,6 +57,9 @@
#include <core/gate.hh>
#include <core/fstream.hh>
#include <seastar/core/memory.hh>
#include <seastar/core/chunked_fifo.hh>
#include <seastar/core/queue.hh>
#include <seastar/core/sleep.hh>
#include <net/byteorder.hh>
#include "commitlog.hh"
@@ -76,8 +80,10 @@
static logging::logger logger("commitlog");
using namespace std::chrono_literals;
class crc32_nbo {
crc32 _c;
utils::crc32 _c;
public:
template <typename T>
void process(T t) {
@@ -155,6 +161,7 @@ const std::string db::commitlog::descriptor::FILENAME_EXTENSION(".log");
class db::commitlog::segment_manager : public ::enable_shared_from_this<segment_manager> {
public:
config cfg;
std::vector<sstring> _segments_to_replay;
const uint64_t max_size;
const uint64_t max_mutation_size;
// Divide the size-on-disk threshold by #cpus used, since we assume
@@ -162,10 +169,12 @@ public:
const uint64_t max_disk_size; // per-shard
bool _shutdown = false;
std::experimental::optional<shared_promise<>> _shutdown_promise = {};
semaphore _new_segment_semaphore {1};
semaphore _write_semaphore;
semaphore _flush_semaphore;
// Allocation must throw timed_out_error by contract.
using timeout_exception_factory = default_timeout_exception_factory;
basic_semaphore<timeout_exception_factory> _flush_semaphore;
scollectd::registrations _regs;
@@ -174,6 +183,23 @@ public:
using time_point = clock_type::time_point;
using sseg_ptr = lw_shared_ptr<segment>;
using request_controller_type = basic_semaphore<timeout_exception_factory>;
using request_controller_units = semaphore_units<timeout_exception_factory>;
request_controller_type _request_controller;
stdx::optional<shared_future<with_clock<commitlog::timeout_clock>>> _segment_allocating;
void account_memory_usage(size_t size) {
_request_controller.consume(size);
}
void notify_memory_written(size_t size) {
_request_controller.signal(size);
}
future<db::replay_position>
allocate_when_possible(const cf_id_type& id, shared_ptr<entry_writer> writer, commitlog::timeout_clock::time_point timeout);
struct stats {
uint64_t cycle_count = 0;
uint64_t flush_count = 0;
@@ -182,29 +208,18 @@ public:
uint64_t bytes_slack = 0;
uint64_t segments_created = 0;
uint64_t segments_destroyed = 0;
uint64_t pending_writes = 0;
uint64_t pending_flushes = 0;
uint64_t pending_allocations = 0;
uint64_t write_limit_exceeded = 0;
uint64_t flush_limit_exceeded = 0;
uint64_t total_size = 0;
uint64_t buffer_list_bytes = 0;
uint64_t total_size_on_disk = 0;
uint64_t requests_blocked_memory = 0;
};
stats totals;
future<> begin_write() {
++totals.pending_writes; // redundant, given semaphore. but easier to read
if (totals.pending_writes >= cfg.max_active_writes) {
++totals.write_limit_exceeded;
logger.trace("Write ops overflow: {}. Will block.", totals.pending_writes);
}
return _write_semaphore.wait();
}
void end_write() {
_write_semaphore.signal();
--totals.pending_writes;
size_t pending_allocations() const {
return _request_controller.waiters();
}
future<> begin_flush() {
@@ -219,46 +234,7 @@ public:
_flush_semaphore.signal();
--totals.pending_flushes;
}
bool should_wait_for_write() const {
return cfg.mode == sync_mode::BATCH || _write_semaphore.waiters() > 0 || _flush_semaphore.waiters() > 0;
}
segment_manager(config c)
: cfg([&c] {
config cfg(c);
if (cfg.commit_log_location.empty()) {
cfg.commit_log_location = "/var/lib/scylla/commitlog";
}
if (cfg.max_active_writes == 0) {
cfg.max_active_writes = // TODO: call someone to get an idea...
25 * smp::count;
}
cfg.max_active_writes = std::max(uint64_t(1), cfg.max_active_writes / smp::count);
if (cfg.max_active_flushes == 0) {
cfg.max_active_flushes = // TODO: call someone to get an idea...
5 * smp::count;
}
cfg.max_active_flushes = std::max(uint64_t(1), cfg.max_active_flushes / smp::count);
return cfg;
}())
, max_size(std::min<size_t>(std::numeric_limits<position_type>::max(), std::max<size_t>(cfg.commitlog_segment_size_in_mb, 1) * 1024 * 1024))
, max_mutation_size(max_size >> 1)
, max_disk_size(size_t(std::ceil(cfg.commitlog_total_space_in_mb / double(smp::count))) * 1024 * 1024)
, _write_semaphore(cfg.max_active_writes)
, _flush_semaphore(cfg.max_active_flushes)
{
assert(max_size > 0);
logger.trace("Commitlog {} maximum disk size: {} MB / cpu ({} cpus)",
cfg.commit_log_location, max_disk_size / (1024 * 1024),
smp::count);
_regs = create_counters();
}
segment_manager(config c);
~segment_manager() {
logger.trace("Commitlog {} disposed", cfg.commit_log_location);
}
@@ -267,9 +243,19 @@ public:
return ++_ids;
}
std::exception_ptr sanity_check_size(size_t size) {
if (size > max_mutation_size) {
return make_exception_ptr(std::invalid_argument(
"Mutation of " + std::to_string(size)
+ " bytes is too large for the maxiumum size of "
+ std::to_string(max_mutation_size)));
}
return nullptr;
}
future<> init();
future<sseg_ptr> new_segment();
future<sseg_ptr> active_segment();
future<sseg_ptr> active_segment(commitlog::timeout_clock::time_point timeout);
future<sseg_ptr> allocate_segment(bool active);
future<> clear();
@@ -278,7 +264,7 @@ public:
scollectd::registrations create_counters();
void orphan_all();
future<> orphan_all();
void discard_unused_segments();
void discard_completed_segments(const cf_id_type& id,
@@ -314,21 +300,19 @@ public:
void flush_segments(bool = false);
private:
future<> clear_reserve_segments();
size_t max_request_controller_units() const;
segment_id_type _ids = 0;
std::vector<sseg_ptr> _segments;
std::deque<sseg_ptr> _reserve_segments;
queue<sseg_ptr> _reserve_segments;
std::vector<buffer_type> _temp_buffers;
std::unordered_map<flush_handler_id, flush_handler> _flush_handlers;
flush_handler_id _flush_ids = 0;
replay_position _flush_position;
timer<clock_type> _timer;
size_t _reserve_allocating = 0;
// # segments to try to keep available in reserve
// i.e. the amount of segments we expect to consume inbetween timer
// callbacks.
// The idea is that since the files are 0 len at start, and thus cost little,
// it is easier to adapt this value compared to timer freq.
size_t _num_reserve_segments = 0;
future<> replenish_reserve();
future<> _reserve_replenisher;
seastar::gate _gate;
uint64_t _new_counter = 0;
};
@@ -388,8 +372,6 @@ class db::commitlog::segment: public enable_lw_shared_from_this<segment> {
uint64_t _buf_pos = 0;
bool _closed = false;
size_t _needed_size = 0;
using buffer_type = segment_manager::buffer_type;
using sseg_ptr = segment_manager::sseg_ptr;
using clock_type = segment_manager::clock_type;
@@ -420,17 +402,6 @@ class db::commitlog::segment: public enable_lw_shared_from_this<segment> {
_segment_manager->end_flush();
}
future<> begin_write() {
// This is maintaining the semantica of only using the write-lock
// as a gate for flushing, i.e. once we've begun a flush for position X
// we are ok with writes to positions > X
return _segment_manager->begin_write();
}
void end_write() {
_segment_manager->end_write();
}
public:
struct cf_mark {
const segment& s;
@@ -500,11 +471,10 @@ public:
/**
* Finalize this segment and get a new one
*/
future<sseg_ptr> finish_and_get_new() {
future<sseg_ptr> finish_and_get_new(commitlog::timeout_clock::time_point timeout) {
_closed = true;
return maybe_wait_for_write(sync()).then([](sseg_ptr s) {
return s->_segment_manager->active_segment();
});
sync();
return _segment_manager->active_segment(timeout);
}
void reset_sync_time() {
_sync_time = clock_type::now();
@@ -589,9 +559,6 @@ public:
void new_buffer(size_t s) {
assert(_buffer.empty());
s += _needed_size;
_needed_size = 0;
auto overhead = segment_overhead_size;
if (_file_pos == 0) {
overhead += descriptor_header_size;
@@ -684,8 +651,6 @@ public:
// The write will be allowed to start now, but flush (below) must wait for not only this,
// but all previous write/flush pairs.
return _pending_ops.run_with_ordered_post_op(rp, [this, size, off, buf = std::move(buf)]() mutable {
// This could "block", if we have to many pending writes.
return begin_write().then([this, size, off, buf = std::move(buf)]() mutable {
auto written = make_lw_shared<size_t>(0);
auto p = buf.get();
return repeat([this, size, off, written, p]() mutable {
@@ -712,63 +677,17 @@ public:
throw;
}
});
}).finally([this, buf = std::move(buf)]() mutable {
}).finally([this, buf = std::move(buf), size]() mutable {
_segment_manager->release_buffer(std::move(buf));
_segment_manager->notify_memory_written(size);
});
}).finally([this]() {
end_write(); // release
});
}, [me, flush_after, top, rp] { // lambda instead of bind, so we keep "me" alive.
assert(me->_pending_ops.has_operation(rp));
return flush_after ? me->do_flush(top) : make_ready_future<sseg_ptr>(me);
});
}
future<sseg_ptr> maybe_wait_for_write(future<sseg_ptr> f) {
if (_segment_manager->should_wait_for_write()) {
++_write_waiters;
logger.trace("Too many pending writes. Must wait.");
return f.finally([this] {
--_write_waiters;
});
}
return make_ready_future<sseg_ptr>(shared_from_this());
}
/**
* If an allocation causes a write, and the write causes a block,
* any allocations post that need to wait for this to finish,
* other wise we will just continue building up more write queue
* eventually (+ loose more ordering)
*
* Some caution here, since maybe_wait_for_write actually
* releases _all_ queued up ops when finishing, we could get
* "bursts" of alloc->write, causing build-ups anyway.
* This should be measured properly. For now I am hoping this
* will work out as these should "block as a group". However,
* buffer memory usage might grow...
*/
bool must_wait_for_alloc() {
// Note: write_waiters is decremented _after_ both semaphores and
// flush queue might be cleared. So we should not look only at it.
// But we still don't want to look at "should_wait_for_write" directly,
// since that is "global" and includes other segments, and we want to
// know if _this_ segment has blocking write ops pending.
// So we also check that the flush queue is non-empty.
return _write_waiters > 0 && !_pending_ops.empty();
}
future<sseg_ptr> wait_for_alloc() {
auto me = shared_from_this();
++_segment_manager->totals.pending_allocations;
logger.trace("Previous allocation is blocking. Must wait.");
return _pending_ops.wait_for_pending().then_wrapped([me](auto f) { // TODO: do we need a finally?
--me->_segment_manager->totals.pending_allocations;
return f.failed() ? me->_segment_manager->active_segment() : make_ready_future<sseg_ptr>(me);
});
}
future<sseg_ptr> batch_cycle() {
future<sseg_ptr> batch_cycle(timeout_clock::time_point timeout) {
/**
* For batch mode we force a write "immediately".
* However, we first wait for all previous writes/flushes
@@ -779,7 +698,7 @@ public:
*/
auto me = shared_from_this();
auto fp = _file_pos;
return _pending_ops.wait_for_pending().then([me = std::move(me), fp] {
return _pending_ops.wait_for_pending(timeout).then([me = std::move(me), fp, timeout] {
if (fp != me->_file_pos) {
// some other request already wrote this buffer.
// If so, wait for the operation at our intended file offset
@@ -787,12 +706,14 @@ public:
// are in accord.
// (Note: wait_for_pending(pos) waits for operation _at_ pos (and before),
replay_position rp(me->_desc.id, position_type(fp));
return me->_pending_ops.wait_for_pending(rp).then([me, fp] {
return me->_pending_ops.wait_for_pending(rp, timeout).then([me, fp] {
assert(me->_flush_pos > fp);
return make_ready_future<sseg_ptr>(me);
});
}
return me->sync();
// It is ok to leave the sync behind on timeout because there will be at most one
// such sync, all later allocations will block on _pending_ops until it is done.
return with_timeout(timeout, me->sync());
}).handle_exception([me, fp](auto p) {
// If we get an IO exception (which we assume this is)
// we should close the segment.
@@ -802,51 +723,52 @@ public:
return make_exception_future<sseg_ptr>(p);
});
}
/**
* Add a "mutation" to the segment.
*/
future<replay_position> allocate(const cf_id_type& id, shared_ptr<entry_writer> writer) {
const auto size = writer->size(*this);
const auto s = size + entry_overhead_size; // total size
if (s > _segment_manager->max_mutation_size) {
return make_exception_future<replay_position>(
std::invalid_argument(
"Mutation of " + std::to_string(s)
+ " bytes is too large for the maxiumum size of "
+ std::to_string(_segment_manager->max_mutation_size)));
future<replay_position> allocate(const cf_id_type& id, shared_ptr<entry_writer> writer, segment_manager::request_controller_units permit, commitlog::timeout_clock::time_point timeout) {
if (must_sync()) {
return with_timeout(timeout, sync()).then([this, id, writer = std::move(writer), permit = std::move(permit), timeout] (auto s) mutable {
return s->allocate(id, std::move(writer), std::move(permit), timeout);
});
}
std::experimental::optional<future<sseg_ptr>> op;
const auto size = writer->size(*this);
const auto s = size + entry_overhead_size; // total size
auto ep = _segment_manager->sanity_check_size(s);
if (ep) {
return make_exception_future<replay_position>(std::move(ep));
}
if (must_sync()) {
op = sync();
} else if (must_wait_for_alloc()) {
op = wait_for_alloc();
} else if (!is_still_allocating() || position() + s > _segment_manager->max_size) { // would we make the file too big?
// do this in next segment instead.
op = finish_and_get_new();
} else if (_buffer.empty()) {
new_buffer(s);
} else if (s > (_buffer.size() - _buf_pos)) { // enough data?
_needed_size += s; // hint to next new_buffer, in case we are not first.
if (!is_still_allocating() || position() + s > _segment_manager->max_size) { // would we make the file too big?
return finish_and_get_new(timeout).then([id, writer = std::move(writer), permit = std::move(permit), timeout] (auto new_seg) mutable {
return new_seg->allocate(id, std::move(writer), std::move(permit), timeout);
});
} else if (!_buffer.empty() && (s > (_buffer.size() - _buf_pos))) { // enough data?
if (_segment_manager->cfg.mode == sync_mode::BATCH) {
// TODO: this could cause starvation if we're really unlucky.
// If we run batch mode and find ourselves not fit in a non-empty
// buffer, we must force a cycle and wait for it (to keep flush order)
// This will most likely cause parallel writes, and consecutive flushes.
op = cycle(true);
return with_timeout(timeout, cycle(true)).then([this, id, writer = std::move(writer), permit = std::move(permit), timeout] (auto new_seg) mutable {
return new_seg->allocate(id, std::move(writer), std::move(permit), timeout);
});
} else {
op = maybe_wait_for_write(cycle());
cycle();
}
}
if (op) {
return op->then([id, writer = std::move(writer)] (sseg_ptr new_seg) mutable {
return new_seg->allocate(id, std::move(writer));
});
size_t buf_memory = s;
if (_buffer.empty()) {
new_buffer(s);
buf_memory += _buf_pos;
}
_gate.enter(); // this might throw. I guess we accept this?
buf_memory -= permit.release();
_segment_manager->account_memory_usage(buf_memory);
replay_position rp(_desc.id, position());
auto pos = _buf_pos;
@@ -877,12 +799,18 @@ public:
_gate.leave();
if (_segment_manager->cfg.mode == sync_mode::BATCH) {
return batch_cycle().then([rp](auto s) {
return batch_cycle(timeout).then([rp](auto s) {
return make_ready_future<replay_position>(rp);
});
} else {
// If this buffer alone is too big, potentially bigger than the maximum allowed size,
// then no other request will be allowed in to force the cycle()ing of this buffer. We
// have to do it ourselves.
if ((_buf_pos >= (db::commitlog::segment::default_size))) {
cycle();
}
return make_ready_future<replay_position>(rp);
}
return make_ready_future<replay_position>(rp);
}
position_type position() const {
@@ -900,6 +828,7 @@ public:
std::fill(_buffer.get_write() + _buf_pos, _buffer.get_write() + size,
0);
_segment_manager->totals.bytes_slack += (size - _buf_pos);
_segment_manager->account_memory_usage(size - _buf_pos);
return size;
}
void mark_clean(const cf_id_type& id, position_type pos) {
@@ -941,8 +870,98 @@ public:
}
};
future<db::replay_position>
db::commitlog::segment_manager::allocate_when_possible(const cf_id_type& id, shared_ptr<entry_writer> writer, commitlog::timeout_clock::time_point timeout) {
auto size = writer->size();
// If this is already too big now, we should throw early. It's also a correctness issue, since
// if we are too big at this moment we'll never reach allocate() to actually throw at that
// point.
auto ep = sanity_check_size(size);
if (ep) {
return make_exception_future<replay_position>(std::move(ep));
}
auto fut = get_units(_request_controller, size, timeout);
if (_request_controller.waiters()) {
totals.requests_blocked_memory++;
}
return fut.then([this, id, writer = std::move(writer), timeout] (auto permit) mutable {
return this->active_segment(timeout).then([this, timeout, id, writer = std::move(writer), permit = std::move(permit)] (auto s) mutable {
return s->allocate(id, std::move(writer), std::move(permit), timeout);
});
});
}
const size_t db::commitlog::segment::default_size;
db::commitlog::segment_manager::segment_manager(config c)
: cfg([&c] {
config cfg(c);
if (cfg.commit_log_location.empty()) {
cfg.commit_log_location = "/var/lib/scylla/commitlog";
}
if (cfg.max_active_writes == 0) {
cfg.max_active_writes = // TODO: call someone to get an idea...
25 * smp::count;
}
cfg.max_active_writes = std::max(uint64_t(1), cfg.max_active_writes / smp::count);
if (cfg.max_active_flushes == 0) {
cfg.max_active_flushes = // TODO: call someone to get an idea...
5 * smp::count;
}
cfg.max_active_flushes = std::max(uint64_t(1), cfg.max_active_flushes / smp::count);
return cfg;
}())
, max_size(std::min<size_t>(std::numeric_limits<position_type>::max(), std::max<size_t>(cfg.commitlog_segment_size_in_mb, 1) * 1024 * 1024))
, max_mutation_size(max_size >> 1)
, max_disk_size(size_t(std::ceil(cfg.commitlog_total_space_in_mb / double(smp::count))) * 1024 * 1024)
, _flush_semaphore(cfg.max_active_flushes)
// That is enough concurrency to allow for our largest mutation (max_mutation_size), plus
// an existing in-flight buffer. Since we'll force the cycling() of any buffer that is bigger
// than default_size at the end of the allocation, that allows for every valid mutation to
// always be admitted for processing.
, _request_controller(max_request_controller_units())
, _reserve_segments(1)
, _reserve_replenisher(make_ready_future<>())
{
assert(max_size > 0);
logger.trace("Commitlog {} maximum disk size: {} MB / cpu ({} cpus)",
cfg.commit_log_location, max_disk_size / (1024 * 1024),
smp::count);
_regs = create_counters();
}
size_t db::commitlog::segment_manager::max_request_controller_units() const {
return max_mutation_size + db::commitlog::segment::default_size;
}
future<> db::commitlog::segment_manager::replenish_reserve() {
return do_until([this] { return _shutdown; }, [this] {
return _reserve_segments.not_full().then([this] {
if (_shutdown) {
return make_ready_future<>();
}
return with_gate(_gate, [this] {
return this->allocate_segment(false).then([this](sseg_ptr s) {
auto ret = _reserve_segments.push(std::move(s));
if (!ret) {
logger.error("Segment reserve is full! Ignoring and trying to continue, but shouldn't happen");
}
return make_ready_future<>();
});
}).handle_exception([](std::exception_ptr ep) {
logger.warn("Exception in segment reservation: {}", ep);
return sleep(100ms);
});
});
});
}
future<std::vector<db::commitlog::descriptor>>
db::commitlog::segment_manager::list_descriptors(sstring dirname) {
struct helper {
@@ -992,7 +1011,7 @@ db::commitlog::segment_manager::list_descriptors(sstring dirname) {
}
};
return open_checked_directory(commit_error, dirname).then([this, dirname](file dir) {
return open_checked_directory(commit_error_handler, dirname).then([this, dirname](file dir) {
auto h = make_lw_shared<helper>(std::move(dirname), std::move(dir));
return h->done().then([h]() {
return make_ready_future<std::vector<db::commitlog::descriptor>>(std::move(h->_result));
@@ -1002,9 +1021,11 @@ db::commitlog::segment_manager::list_descriptors(sstring dirname) {
future<> db::commitlog::segment_manager::init() {
return list_descriptors(cfg.commit_log_location).then([this](std::vector<descriptor> descs) {
assert(_reserve_segments.empty()); // _segments_to_replay must not pick them up
segment_id_type id = std::chrono::duration_cast<std::chrono::milliseconds>(runtime::get_boot_time().time_since_epoch()).count() + 1;
for (auto& d : descs) {
id = std::max(id, replay_position(d.id).base_id());
_segments_to_replay.push_back(cfg.commit_log_location + "/" + d.filename());
}
// base id counter is [ <shard> | <base> ]
@@ -1013,6 +1034,9 @@ future<> db::commitlog::segment_manager::init() {
_timer.set_callback(std::bind(&segment_manager::on_timer, this));
auto delay = engine().cpu_id() * std::ceil(double(cfg.commitlog_sync_period_in_ms) / smp::count);
logger.trace("Delaying timer loop {} ms", delay);
// We need to wait until we have scanned all other segments to actually start serving new
// segments. We are ready now
this->_reserve_replenisher = replenish_reserve();
this->arm(delay);
});
}
@@ -1070,19 +1094,21 @@ scollectd::registrations db::commitlog::segment_manager::create_counters() {
, make_typed(data_type::DERIVE, totals.bytes_slack)
),
add_polled_metric(type_instance_id("commitlog"
, per_cpu_plugin_instance, "queue_length", "pending_writes")
, make_typed(data_type::GAUGE, totals.pending_writes)
),
add_polled_metric(type_instance_id("commitlog"
, per_cpu_plugin_instance, "queue_length", "pending_flushes")
, make_typed(data_type::GAUGE, totals.pending_flushes)
),
add_polled_metric(type_instance_id("commitlog"
, per_cpu_plugin_instance, "total_operations", "write_limit_exceeded")
, make_typed(data_type::DERIVE, totals.write_limit_exceeded)
, per_cpu_plugin_instance, "queue_length", "pending_allocations")
, make_typed(data_type::GAUGE, [this] { return pending_allocations(); })
),
add_polled_metric(type_instance_id("commitlog"
, per_cpu_plugin_instance, "total_operations", "requests_blocked_memory")
, make_typed(data_type::DERIVE, totals.requests_blocked_memory)
),
add_polled_metric(type_instance_id("commitlog"
, per_cpu_plugin_instance, "total_operations", "flush_limit_exceeded")
, make_typed(data_type::DERIVE, totals.flush_limit_exceeded)
@@ -1142,7 +1168,7 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
descriptor d(next_id());
file_open_options opt;
opt.extent_allocation_size_hint = max_size;
return open_checked_file_dma(commit_error, cfg.commit_log_location + "/" + d.filename(), open_flags::wo | open_flags::create, opt).then([this, d, active](file f) {
return open_checked_file_dma(commit_error_handler, cfg.commit_log_location + "/" + d.filename(), open_flags::wo | open_flags::create, opt).then([this, d, active](file f) {
// xfs doesn't like files extended betond eof, so enlarge the file
return f.truncate(max_size).then([this, d, active, f] () mutable {
auto s = make_lw_shared<segment>(this->shared_from_this(), d, std::move(f), active);
@@ -1158,36 +1184,42 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
++_new_counter;
if (_reserve_segments.empty()) {
if (_num_reserve_segments < cfg.max_reserve_segments) {
++_num_reserve_segments;
logger.trace("Increased segment reserve count to {}", _num_reserve_segments);
}
return allocate_segment(true).then([this](sseg_ptr s) {
_segments.push_back(s);
return make_ready_future<sseg_ptr>(s);
});
if (_reserve_segments.empty() && (_reserve_segments.max_size() < cfg.max_reserve_segments)) {
_reserve_segments.set_max_size(_reserve_segments.max_size() + 1);
logger.debug("Increased segment reserve count to {}", _reserve_segments.max_size());
}
_segments.push_back(_reserve_segments.front());
_reserve_segments.pop_front();
_segments.back()->reset_sync_time();
logger.trace("Acquired segment {} from reserve", _segments.back());
return make_ready_future<sseg_ptr>(_segments.back());
return _reserve_segments.pop_eventually().then([this] (auto s) {
_segments.push_back(std::move(s));
_segments.back()->reset_sync_time();
return make_ready_future<sseg_ptr>(_segments.back());
});
}
future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager::active_segment() {
if (_segments.empty() || !_segments.back()->is_still_allocating()) {
return _new_segment_semaphore.wait().then([this]() {
if (_segments.empty() || !_segments.back()->is_still_allocating()) {
return new_segment();
future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager::active_segment(commitlog::timeout_clock::time_point timeout) {
// If there is no active segment, try to allocate one using new_segment(). If we time out,
// make sure later invocations can still pick that segment up once it's ready.
return repeat_until_value([this, timeout] () -> future<stdx::optional<sseg_ptr>> {
if (!_segments.empty() && _segments.back()->is_still_allocating()) {
return make_ready_future<stdx::optional<sseg_ptr>>(_segments.back());
}
return [this, timeout] {
if (!_segment_allocating) {
promise<> p;
_segment_allocating.emplace(p.get_future());
auto f = _segment_allocating->get_future(timeout);
with_gate(_gate, [this] {
return new_segment().discard_result().finally([this]() {
_segment_allocating = stdx::nullopt;
});
}).forward_to(std::move(p));
return f;
} else {
return _segment_allocating->get_future(timeout);
}
return make_ready_future<sseg_ptr>(_segments.back());
}).finally([this]() {
_new_segment_semaphore.signal();
}().then([] () -> stdx::optional<sseg_ptr> {
return stdx::nullopt;
});
}
return make_ready_future<sseg_ptr>(_segments.back());
});
}
/**
@@ -1241,6 +1273,15 @@ void db::commitlog::segment_manager::discard_unused_segments() {
}
}
// FIXME: pop() will call unlink -> sleeping in reactor thread.
// Not urgent since mostly called during shutdown, but have to fix.
future<> db::commitlog::segment_manager::clear_reserve_segments() {
while (!_reserve_segments.empty()) {
_reserve_segments.pop();
}
return make_ready_future<>();
}
future<> db::commitlog::segment_manager::sync_all_segments(bool shutdown) {
logger.debug("Issuing sync for all segments");
return parallel_for_each(_segments, [this, shutdown](sseg_ptr s) {
@@ -1251,19 +1292,40 @@ future<> db::commitlog::segment_manager::sync_all_segments(bool shutdown) {
}
future<> db::commitlog::segment_manager::shutdown() {
if (!_shutdown) {
_shutdown = true; // no re-arm, no create new segments.
_timer.cancel(); // no more timer calls
// Now first wait for periodic task to finish, then sync and close all
// segments, flushing out any remaining data.
return _gate.close().then(std::bind(&segment_manager::sync_all_segments, this, true));
if (!_shutdown_promise) {
_shutdown_promise = shared_promise<>();
// Wait for all pending requests to finish. Need to sync first because segments that are
// alive may be holding semaphore permits.
auto block_new_requests = get_units(_request_controller, max_request_controller_units());
return sync_all_segments(false).then([this, block_new_requests = std::move(block_new_requests)] () mutable {
return std::move(block_new_requests).then([this] (auto permits) {
_timer.cancel(); // no more timer calls
_shutdown = true; // no re-arm, no create new segments.
// Now first wait for periodic task to finish, then sync and close all
// segments, flushing out any remaining data.
return _gate.close().then(std::bind(&segment_manager::sync_all_segments, this, true));
});
}).finally([this] {
// Now that the gate is closed and requests completed we are sure nobody else will pop()
return clear_reserve_segments().finally([this] {
return std::move(_reserve_replenisher).then_wrapped([this] (auto f) {
// Could be cleaner with proper seastar support
if (f.failed()) {
_shutdown_promise->set_exception(f.get_exception());
} else {
_shutdown_promise->set_value();
}
});
});
});
}
return make_ready_future<>();
return _shutdown_promise->get_shared_future();
}
void db::commitlog::segment_manager::orphan_all() {
future<> db::commitlog::segment_manager::orphan_all() {
_segments.clear();
_reserve_segments.clear();
return clear_reserve_segments();
}
/*
@@ -1278,7 +1340,7 @@ future<> db::commitlog::segment_manager::clear() {
for (auto& s : _segments) {
s->mark_clean();
}
orphan_all();
return orphan_all();
});
}
/**
@@ -1309,37 +1371,7 @@ void db::commitlog::segment_manager::on_timer() {
flush_segments();
}
}
// take outstanding allocations into regard. This is paranoid,
// but if for some reason the file::open takes longer than timer period,
// we could flood the reserve list with new segments
//
// #482 - _reserve_allocating is decremented in the finally clause below.
// This is needed because if either allocate_segment _or_ emplacing into
// _reserve_segments should throw, we still need the counter reset
// However, because of this, it might be that emplace was done, but not decrement,
// when we get here again. So occasionally we might get a sum of the two that is
// not consistent. It should however always just potentially be _to much_, i.e.
// just an indicator that we don't need to do anything. So lets do that.
auto n = std::min(_reserve_segments.size() + _reserve_allocating, _num_reserve_segments);
return parallel_for_each(boost::irange(n, _num_reserve_segments), [this, n](auto i) {
++_reserve_allocating;
return this->allocate_segment(false).then([this](sseg_ptr s) {
if (!_shutdown) {
// insertion sort.
auto i = std::upper_bound(_reserve_segments.begin(), _reserve_segments.end(), s, [](sseg_ptr s1, sseg_ptr s2) {
const descriptor& d1 = s1->_desc;
const descriptor& d2 = s2->_desc;
return d1.id < d2.id;
});
i = _reserve_segments.emplace(i, std::move(s));
logger.trace("Added reserve segment {}", *i);
}
}).finally([this] {
--_reserve_allocating;
});
});
}).handle_exception([](std::exception_ptr ep) {
logger.warn("Exception in segment reservation: {}", ep);
return make_ready_future<>();
});
arm();
}
@@ -1410,7 +1442,7 @@ void db::commitlog::segment_manager::release_buffer(buffer_type&& b) {
* Add mutation.
*/
future<db::replay_position> db::commitlog::add(const cf_id_type& id,
size_t size, serializer_func func) {
size_t size, commitlog::timeout_clock::time_point timeout, serializer_func func) {
class serializer_func_entry_writer final : public entry_writer {
serializer_func _func;
size_t _size;
@@ -1419,17 +1451,16 @@ future<db::replay_position> db::commitlog::add(const cf_id_type& id,
: _func(std::move(func)), _size(sz)
{ }
virtual size_t size(segment&) override { return _size; }
virtual size_t size() override { return _size; }
virtual void write(segment&, output& out) override {
_func(out);
}
};
auto writer = ::make_shared<serializer_func_entry_writer>(size, std::move(func));
return _segment_manager->active_segment().then([id, writer] (auto s) {
return s->allocate(id, writer);
});
return _segment_manager->allocate_when_possible(id, writer, timeout);
}
future<db::replay_position> db::commitlog::add_entry(const cf_id_type& id, const commitlog_entry_writer& cew)
future<db::replay_position> db::commitlog::add_entry(const cf_id_type& id, const commitlog_entry_writer& cew, timeout_clock::time_point timeout)
{
class cl_entry_writer final : public entry_writer {
commitlog_entry_writer _writer;
@@ -1439,6 +1470,9 @@ future<db::replay_position> db::commitlog::add_entry(const cf_id_type& id, const
_writer.set_with_schema(!seg.is_schema_version_known(_writer.schema()));
return _writer.size();
}
virtual size_t size() override {
return _writer.mutation_size();
}
virtual void write(segment& seg, output& out) override {
if (_writer.with_schema()) {
seg.add_schema_version(_writer.schema());
@@ -1447,9 +1481,7 @@ future<db::replay_position> db::commitlog::add_entry(const cf_id_type& id, const
}
};
auto writer = ::make_shared<cl_entry_writer>(cew);
return _segment_manager->active_segment().then([id, writer] (auto s) {
return s->allocate(id, writer);
});
return _segment_manager->allocate_when_possible(id, writer, timeout);
}
db::commitlog::commitlog(config cfg)
@@ -1546,7 +1578,7 @@ const db::commitlog::config& db::commitlog::active_config() const {
// on error at startup if required
future<std::unique_ptr<subscription<temporary_buffer<char>, db::replay_position>>>
db::commitlog::read_log_file(const sstring& filename, commit_load_reader_func next, position_type off) {
return open_checked_file_dma(commit_error, filename, open_flags::ro).then([next = std::move(next), off](file f) {
return open_checked_file_dma(commit_error_handler, filename, open_flags::ro).then([next = std::move(next), off](file f) {
return std::make_unique<subscription<temporary_buffer<char>, replay_position>>(
read_log_file(std::move(f), std::move(next), off));
});
@@ -1557,6 +1589,15 @@ db::commitlog::read_log_file(const sstring& filename, commit_load_reader_func ne
subscription<temporary_buffer<char>, db::replay_position>
db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type off) {
struct work {
private:
file_input_stream_options make_file_input_stream_options() {
file_input_stream_options fo;
fo.buffer_size = db::commitlog::segment::default_size;
fo.read_ahead = 10;
fo.io_priority_class = service::get_local_commitlog_priority();
return fo;
}
public:
file f;
stream<temporary_buffer<char>, replay_position> s;
input_stream<char> fin;
@@ -1570,9 +1611,10 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
size_t corrupt_size = 0;
bool eof = false;
bool header = true;
bool failed = false;
work(file f, position_type o = 0)
: f(f), fin(make_file_input_stream(f)), start_off(o) {
: f(f), fin(make_file_input_stream(f, 0, make_file_input_stream_options())), start_off(o) {
}
work(work&&) = default;
@@ -1603,6 +1645,10 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
eof = true;
return make_ready_future<>();
}
future<> fail() {
failed = true;
return stop();
}
future<> read_header() {
return fin.read_exactly(segment::descriptor_header_size).then([this](temporary_buffer<char> buf) {
if (!advance(buf)) {
@@ -1739,7 +1785,9 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
return make_ready_future<>();
}
return s.produce(buf.share(0, data_size), rp);
return s.produce(buf.share(0, data_size), rp).handle_exception([this](auto ep) {
return this->fail();
});
});
});
}
@@ -1755,6 +1803,8 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
throw segment_data_corruption_error("Data corruption", corrupt_size);
}
});
}).finally([this] {
return fin.close();
});
}
};
@@ -1763,7 +1813,9 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
auto ret = w->s.listen(std::move(next));
w->s.started().then(std::bind(&work::read_file, w.get())).then([w] {
w->s.close();
if (!w->failed) {
w->s.close();
}
}).handle_exception([w](auto ep) {
w->s.set_exception(ep);
});
@@ -1788,12 +1840,7 @@ uint64_t db::commitlog::get_flush_count() const {
}
uint64_t db::commitlog::get_pending_tasks() const {
return _segment_manager->totals.pending_writes
+ _segment_manager->totals.pending_flushes;
}
uint64_t db::commitlog::get_pending_writes() const {
return _segment_manager->totals.pending_writes;
return _segment_manager->totals.pending_flushes;
}
uint64_t db::commitlog::get_pending_flushes() const {
@@ -1801,11 +1848,7 @@ uint64_t db::commitlog::get_pending_flushes() const {
}
uint64_t db::commitlog::get_pending_allocations() const {
return _segment_manager->totals.pending_allocations;
}
uint64_t db::commitlog::get_write_limit_exceeded_count() const {
return _segment_manager->totals.write_limit_exceeded;
return _segment_manager->pending_allocations();
}
uint64_t db::commitlog::get_flush_limit_exceeded_count() const {
@@ -1850,3 +1893,6 @@ future<std::vector<sstring>> db::commitlog::list_existing_segments(const sstring
});
}
std::vector<sstring> db::commitlog::get_segments_to_replay() {
return std::move(_segment_manager->_segments_to_replay);
}

View File

@@ -94,6 +94,8 @@ using cf_id_type = utils::UUID;
*/
class commitlog {
public:
using timeout_clock = std::chrono::steady_clock;
class segment_manager;
class segment;
@@ -171,9 +173,23 @@ public:
/**
* Add a "Mutation" to the commit log.
*
* Resolves with timed_out_error when timeout is reached.
*
* @param mutation_func a function that writes 'size' bytes to the log, representing the mutation.
*/
future<replay_position> add(const cf_id_type& id, size_t size, serializer_func mutation_func);
future<replay_position> add(const cf_id_type& id, size_t size, timeout_clock::time_point timeout, serializer_func mutation_func);
/**
* Template version of add.
* Resolves with timed_out_error when timeout is reached.
* @param mu an invokable op that generates the serialized data. (Of size bytes)
*/
template<typename _MutationOp>
future<replay_position> add_mutation(const cf_id_type& id, size_t size, timeout_clock::time_point timeout, _MutationOp&& mu) {
return add(id, size, timeout, [mu = std::forward<_MutationOp>(mu)](output& out) {
mu(out);
});
}
/**
* Template version of add.
@@ -181,17 +197,15 @@ public:
*/
template<typename _MutationOp>
future<replay_position> add_mutation(const cf_id_type& id, size_t size, _MutationOp&& mu) {
return add(id, size, [mu = std::forward<_MutationOp>(mu)](output& out) {
mu(out);
});
return add_mutation(id, size, timeout_clock::time_point::max(), std::forward<_MutationOp>(mu));
}
/**
* Add an entry to the commit log.
*
* Resolves with timed_out_error when timeout is reached.
* @param entry_writer a writer responsible for writing the entry
*/
future<replay_position> add_entry(const cf_id_type& id, const commitlog_entry_writer& entry_writer);
future<replay_position> add_entry(const cf_id_type& id, const commitlog_entry_writer& entry_writer, timeout_clock::time_point timeout);
/**
* Modifies the per-CF dirty cursors of any commit log segments for the column family according to the position
@@ -241,14 +255,20 @@ public:
*/
std::vector<sstring> get_active_segment_names() const;
/**
* Returns a vector of segment paths which were
* preexisting when this instance of commitlog was created.
*
* The list will be empty when called for the second time.
*/
std::vector<sstring> get_segments_to_replay();
uint64_t get_total_size() const;
uint64_t get_completed_tasks() const;
uint64_t get_flush_count() const;
uint64_t get_pending_tasks() const;
uint64_t get_pending_writes() const;
uint64_t get_pending_flushes() const;
uint64_t get_pending_allocations() const;
uint64_t get_write_limit_exceeded_count() const;
uint64_t get_flush_limit_exceeded_count() const;
uint64_t get_num_segments_created() const;
uint64_t get_num_segments_destroyed() const;
@@ -321,6 +341,8 @@ private:
struct entry_writer {
virtual size_t size(segment&) = 0;
// Returns segment-independent size of the entry. Must be <= than segment-dependant size.
virtual size_t size() = 0;
virtual void write(segment&, output&) = 0;
};
};

View File

@@ -33,48 +33,26 @@
#include "idl/mutation.dist.impl.hh"
#include "idl/commitlog.dist.impl.hh"
commitlog_entry::commitlog_entry(stdx::optional<column_mapping> mapping, frozen_mutation&& mutation)
: _mapping(std::move(mapping))
, _mutation_storage(std::move(mutation))
, _mutation(*_mutation_storage)
{ }
commitlog_entry::commitlog_entry(stdx::optional<column_mapping> mapping, const frozen_mutation& mutation)
: _mapping(std::move(mapping))
, _mutation(mutation)
{ }
commitlog_entry::commitlog_entry(commitlog_entry&& ce)
: _mapping(std::move(ce._mapping))
, _mutation_storage(std::move(ce._mutation_storage))
, _mutation(_mutation_storage ? *_mutation_storage : ce._mutation)
{
}
commitlog_entry& commitlog_entry::operator=(commitlog_entry&& ce)
{
if (this != &ce) {
this->~commitlog_entry();
new (this) commitlog_entry(std::move(ce));
}
return *this;
}
commitlog_entry commitlog_entry_writer::get_entry() const {
if (_with_schema) {
return commitlog_entry(_schema->get_column_mapping(), _mutation);
} else {
return commitlog_entry({}, _mutation);
}
template<typename Output>
void commitlog_entry_writer::serialize(Output& out) const {
[this, wr = ser::writer_of_commitlog_entry<Output>(out)] () mutable {
if (_with_schema) {
return std::move(wr).write_mapping(_schema->get_column_mapping());
} else {
return std::move(wr).skip_mapping();
}
}().write_mutation(_mutation).end_commitlog_entry();
}
void commitlog_entry_writer::compute_size() {
_size = ser::get_sizeof(get_entry());
seastar::measuring_output_stream ms;
serialize(ms);
_size = ms.size();
}
void commitlog_entry_writer::write(data_output& out) const {
seastar::simple_output_stream str(out.reserve(size()), size());
ser::serialize(str, get_entry());
serialize(str);
}
commitlog_entry_reader::commitlog_entry_reader(const temporary_buffer<char>& buffer)

View File

@@ -31,15 +31,10 @@ namespace stdx = std::experimental;
class commitlog_entry {
stdx::optional<column_mapping> _mapping;
stdx::optional<frozen_mutation> _mutation_storage;
const frozen_mutation& _mutation;
frozen_mutation _mutation;
public:
commitlog_entry(stdx::optional<column_mapping> mapping, frozen_mutation&& mutation);
commitlog_entry(stdx::optional<column_mapping> mapping, const frozen_mutation& mutation);
commitlog_entry(commitlog_entry&&);
commitlog_entry(const commitlog_entry&) = delete;
commitlog_entry& operator=(commitlog_entry&&);
commitlog_entry& operator=(const commitlog_entry&) = delete;
commitlog_entry(stdx::optional<column_mapping> mapping, frozen_mutation&& mutation)
: _mapping(std::move(mapping)), _mutation(std::move(mutation)) { }
const stdx::optional<column_mapping>& mapping() const { return _mapping; }
const frozen_mutation& mutation() const { return _mutation; }
};
@@ -50,8 +45,9 @@ class commitlog_entry_writer {
bool _with_schema = true;
size_t _size;
private:
template<typename Output>
void serialize(Output&) const;
void compute_size();
commitlog_entry get_entry() const;
public:
commitlog_entry_writer(schema_ptr s, const frozen_mutation& fm)
: _schema(std::move(s)), _mutation(fm)
@@ -74,6 +70,10 @@ public:
return _size;
}
size_t mutation_size() const {
return _mutation.representation().size();
}
void write(data_output& out) const;
};
@@ -84,4 +84,4 @@ public:
const stdx::optional<column_mapping>& get_column_mapping() const { return _ce.mapping(); }
const frozen_mutation& mutation() const { return _ce.mutation(); }
};
};

Some files were not shown because too many files have changed in this diff Show More