Failing to close a file properly before destroying file's object causes
crashes.
[tgrabiec: fixed typo]
Message-Id: <20170221144858.GG11471@scylladb.com>
(cherry picked from commit 0977f4fdf8)
after 'compaction: make major compaction go through compaction manager',
the test fails because task is preempted in debug mode before it reaches
intruction to increase stat.
Backport from 8dfb5f9c33
Included single-line change in compaction_manager_test from
8b0e358d73 that fixes release mode too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170714194946.12467-1-raphaelsc@scylladb.com>
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.
There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.
This patch fixes this and enforces the expected order.
Fixes#2531Fixes#2593
Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162400.15968-1-duarte@scylladb.com>
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.
Fixes#1156.
NOTE: resolved conflict because in this version manager doesn't
stop transportation services for io errors.
(cherry picked from commit 3286f7aaa6)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170614211923.9955-2-raphaelsc@scylladb.com>
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.
Fixes#192.
(cherry picked from commit 7deeffc953)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170614211923.9955-1-raphaelsc@scylladb.com>
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
end and inbetwen. I.e. the test sort of assumed we started
with < 2 (or so) segments. Not always the case (timing)
Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 0c598e5645)
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.
This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.
Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
before after diff
write 79434.35 89247.54 +12.3%
(cherry picked from commit 374c8a56ac)
Also: Fixes#2468.
When using member name in an idetifer of generated class or method
idl compiler should strip the trailing '()'.
(cherry picked from commit 4df4994b71)
(part of #2468)
sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.
Fixes#1592
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
(cherry picked from commit 3b5ad23532)
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.
Fixes#2407.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
(cherry picked from commit b7e1575ad4)
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.
The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.
Fixes#2299
Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
(cherry picked from commit 66e3b73b9c)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Conflicts:
repair/repair.cc
Glauber: conflict is due to the fact that dht::token_range is still called nonwrapping_range
Message-Id: <20170522015356.4236-1-glauber@scylladb.com>
This reverts commit 54db1b5452. Glauber writes:
"I forgot one git ammend, as it seems. The token type had to be changed
in two places for it to compile, and I ended up including just one. I
will send another fixed version soon."
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.
Fixes#2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 0351ab8bc6)
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.
Fixes#2384.
Message-Id: <20170517111722.GY3874@scylladb.com>
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.
The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.
Fixes#2299
Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
(cherry picked from commit 66e3b73b9c)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Conflicts:
repair/repair.cc
Glauber: conflict is due to the fact that dht::token_range is still called nonwrapping_range
* dist/ami/files/scylla-ami d5a4397...407e8f3 (4):
> scylla_create_devices: check block device is exists
> Rewrite disk discovery to handle EBS and NVMEs.
> add --developer-mode option
> trivial cleanup: replace tab in indent
* seastar 5810d50...4a31f27 (6):
> tls: make shutdown/close do "clean" handshake shutdown in background
> tls: Make sink/source (i.e. streams) first class channel owners
> tls.cc: Fix shutdown_input/output to conform with expected socket behaviour
> net/*: qualify net::* refs better -> ::net::* etc
> native-stack: Make sink/source (i.e. streams) first class channel owners
> posix-stack: Make sink/source (i.e. streams) first class channel owners
"This series fixes the user after free issue in gossip and elimates the
duplicated / unnecessary mark alive operations.
Fixes#2341"
* tag 'asias/gossip_fix_mark_alive/v1' of github.com:cloudius-systems/seastar-dev:
gossip: Ignore callbacks and mark alive operation in shadow round
gossip: Ingore the duplicated mark alive operation
gossip: Fix user after free in mark_alive
(cherry picked from commit 1e04731fa0)
"Fixes #2327.
Fixes #2326."
* 'tgrabiec/fix-promoted-index-parsing-1.7' of github.com:cloudius-systems/seastar-dev:
sstables: Fix incorrect parsing of cell names in promoted index
sstables: Fix find_disk_ranges() to not miss relevant range tombstones
(cherry picked from commit ea0591ad3d)
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.
Fixes#2287.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
(cherry picked from commit 11b74050a1)
We take a reference of endpoint_state entry in endpoint_state_map. We
access it again after code which defers, the reference can be invalid
after the defer if someone deletes the entry during the defer.
Fix this by checking take the reference again after the defering code.
I also audited the code to remove unsafe reference to endpoint_state_map entry
as much as possible.
Fixes the following SIGSEGV:
Core was generated by `/usr/bin/scylla --log-to-syslog 1 --log-to-stdout
0 --default-log-level info --'.
Program terminated with signal SIGSEGV, Segmentation fault.
(this=<optimized out>) at /usr/include/c++/5/bits/stl_pair.h:127
127 in /usr/include/c++/5/bits/stl_pair.h
[Current thread is 1 (Thread 0x7f1448f39bc0 (LWP 107308))]
Fixes#2271
Message-Id: <529ec8ede6da884e844bc81d408b93044610afd2.1491960061.git.asias@scylladb.com>
(cherry picked from commit d27b47595b)
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.
This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.
Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.
I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.
v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>
(cherry picked from commit b12b65db92)
Fixes#2173
Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.
Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.
Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
(cherry picked from commit c3a510a08d)
This patch exposes Scylla's Prometheus port by default. You can now use
the Scylla Monitoring project with the Docker image:
https://github.com/scylladb/scylla-grafana-monitoring
To configure the IP addresses, use the 'docker inspect' command to
determine Scylla's IP address (assuming your running container is called
'some-scylla'):
docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla
and then use that IP address in the prometheus/scylla_servers.yml
configuration file.
Fixes#1827
Message-Id: <1490008357-19627-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 85a127bc78)
Fixes#2098
Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.
v2:
* Fixed whitespace errors
Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 078589c508)
On Ubuntu 14.04, the lsblk doesn't have '-p' option, but
`scylla_setup` try to get block list by `lsblk -pnr` and
trigger error.
Current simple pattern will match all help content, it might
match wrong options.
scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e -p
-m, --perms output info about permissions
-P, --pairs use key="value" output format
Let's use strict pattern to only match option at the head. Example:
scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e '^\s*-D'
-D, --discard print discard capabilities
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <4f0f318353a43664e27da8a66855f5831457f061.1489712867.git.amos@scylladb.com>
(cherry picked from commit 468df7dd5f)
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.
Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.
This could be related to scylladb/seastar#250
Fixes#2163.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
(cherry picked from commit 295a981c61)
We have:
auto halves = range.split(midpoint, dht::token_comparator());
We saw a case where midpoint == range.start, as a result, range.split
will assert becasue the range.start is marked non-inclusive, so the
midpoint doesn't appear to be contain()ed in the range - hence the
assertion failure.
Fixes#2148
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>
(cherry picked from commit 39d2e59e7e)
"If a node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.
To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.
Fixes #2129."
* tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev:
db: Create default auth and tracing keyspaces using lowest timestamp
migration_manager: Append actual keyspace mutations with schema notifications
(cherry picked from commit 6db6d25f66)
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.
Fixes#2143
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
(cherry picked from commit 506e074ba4)
logalloc::reclaim_lock prevents reclaim from running which may cause
regular allocation to fail although there is enough of free memory.
To solve that there is an allocation_section which acquire reclaim_lock
and if allocation fails it run reclaimer outside of a lock and retries
the allocation. The patch make use of allocation_section instead of
direct use of reclaim_lock in memtable code.
Fixes#2138.
Message-Id: <20170306160050.GC5902@scylladb.com>
(cherry picked from commit d7bdf16a16)
Since gcc-5/stretch=5.4.1-2 removed from apt repository, we nolonger able to
build gcc-5.
To avoid dead link, use launchpad.net archives instead of using apt-get source.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488189378-5607-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ba323e2074)
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.
I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.
Fixes the following UBSAN report:
/usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'
Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 4b6e77e97e)
Fixes the following UBSAN warning:
core/semaphore.hh:293:74: runtime error: reference binding to misaligned address 0x0000006c55d7 for type 'struct basic_semaphore', which requires 8 byte alignment
Since the field was not initialied properly, probably also fixes some
user-visible bug.
Message-Id: <1488368222-32009-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 0c84f00b16)
Failing to close a file properly before destroying file's object causes
crashes.
[tgrabiec: fixed typo]
Message-Id: <20170221144858.GG11471@scylladb.com>
(cherry picked from commit 0977f4fdf8)
ninja-build-1.6.0-2.fc23.src.rpm on fedora web site deleted for some
reason, but there is ninja-build-1.7.2-2 on EPEL, so we don't need to
backport from Fedora anymore.
Fixes#2087
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487155729-13257-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 9c8515eeed)
"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
(cherry picked from commit 7a00dd6985)
Since scylla-housekeeping running as scylla user, it doesn't have a permission
to create a file on /etc/scylla.d.
So introduce /var/lib/scylla-housekeeping which owns by scylla user, place uuid
file on the directory.
Fixes#2009
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484235946-12463-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit bee7f549a9)
query_mutations_locally() takes one_or_two_partition_ranges by
reference and requires, indirectly, that it is kept alive until
operation resolves. However, we were passing expiring value to it, the
result of unwrap().
Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test
Another potential problem was that we were dereferencing "s" in the same
expression which move-constructs an argument out of it.
Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 1e8151b4f2)
Explicitly generate tables' IDs of tables from the system_traces KS using
generate_legacy_id() in order to ensure all Nodes create these tables with
the same IDs.
This is going to prevent hitting issue #420.
Fixes#1976
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1484153725-31030-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit ca0a0f1458)
It still has problems:
- while resharding a very large leveled compaction strategy table, a huge
amount of tiny sstables are generated, overwhelming the file descriptor
limits
- there is a large impact on read latency while resharding is going on
This patch fixes a regression introduced in 0518895, where we counted
one extra row per partition when it contained live, non static rows.
We also simplify the visitor logic further, since now we don't need to
count rows one by one. Also remove a bunch of unused fields.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482234083-2447-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit d7e607ff51)
This patch gets housekeeping to create a uuid file if a path to a uuid
file is upplied but the file is missing.
Because it import the uuid lib, uuid parameters where renamed.
Fixes#1987
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-2-git-send-email-amnon@scylladb.com>
(cherry picked from commit 32888fc0aa)
mutation_result_merger::get() assumes that the merged result may be a
short read if at least one of the partial results is a short read (in
other words, if none of the partial results is a short read, then the
merged result is also not a short read). However this is not true;
because we update the memory accounter incrementally, we may stop
scanning early. All the partial results are full; but we did not scan
the entire range.
Fix by changing the short_read variable initialization from `no`
(which assumes we'll encounter a short read indication when processing
one of the batches) to `this->short_read()`, which also takes into
account the memory accounter.
Fixes#2001.
Message-Id: <20170108111315.17877-1-avi@scylladb.com>
(cherry picked from commit 8f36dca6f1)
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.
Fixes#1986.
Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.
Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.
Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit cd630fece6)
During a range scan, we try to avoid sorting according to partition range
when we can do so. This is when we scan fewer than smp::count shards --
each shard's range is strictly ordered with respect to the others.
However, we use the wrong key for the sort -- we use the shard number. But
if we started at shard s > 0 and wrapped around to shard 0, then shard 0's
range will be after the range belonging to shard s, but will sort before it.
Fix by storing the iteration order as the sort key. We use that when we
know that shards do not overlap (shards < smp::count) and the index within
the source partition range vector when they do.
Fixes#1998.
Message-Id: <20170105114253.17492-1-avi@scylladb.com>
(cherry picked from commit eb520e7352)
1.6 truncates paged queries early to avoid overrunning server memory
with too-large query results, but in the case of partition range queries,
this terminates too early due to an uninitialized variable holding the
maximum result size. This results in slow performance due to additional
round trips.
Fix by initializing the maximum result size from the result_memory_tracker
running on the coordinating shard.
Fixes#1995.
Message-Id: <20170105103915.10633-1-avi@scylladb.com>
(cherry picked from commit 4667641f5f)
Fix the following build breakage:
FAILED: build/release/gen/cql3/CqlParser.o
g++ -MMD -MT build/release/gen/cql3/CqlParser.o -MF build/release/gen/cql3/CqlParser.o.d -std=gnu++1y -g -Wall -Werror -fvisibility=hidden -pthread -I/home/penberg/scylla/seastar -I/home/penberg/scylla/seastar/fmt -I/home/penberg/scylla/seastar/build/release/gen -march=nehalem -Ifmt -DBOOST_TEST_DYN_LINK -Wno-overloaded-virtual -DFMT_HEADER_ONLY -DHAVE_HWLOC -DHAVE_NUMA -DHAVE_LZ4_COMPRESS_DEFAULT -O2 -DBOOST_TEST_DYN_LINK -Wno-maybe-uninitialized -DHAVE_LIBSYSTEMD=1 -I. -I build/release/gen -I seastar -I seastar/build/release/gen -c -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp
In file included from ./query-request.hh:31:0,
from ./locator/token_metadata.hh:51,
from ./locator/abstract_replication_strategy.hh:29,
from ./database.hh:26,
from ./service/storage_proxy.hh:44,
from ./db/schema_tables.hh:43,
from ./db/system_keyspace.hh:46,
from ./cql3/functions/function_name.hh:45,
from ./cql3/selection/selectable.hh:48,
from ./cql3/selection/writetime_or_ttl.hh:45,
from build/release/gen/cql3/CqlParser.hpp:63,
from build/release/gen/cql3/CqlParser.cpp:44:
./tracing/tracing.hh:357:5: error: ‘scollectd’ does not name a type
scollectd::registrations _registrations;
^~~~~~~~~
Message-Id: <1482939751-8756-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit a443dfa95e)
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges. Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>
(cherry picked from commit 8686a59ea5)
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.
I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.
It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."
* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
query: introduce result_memory_accounter::foreign_state
storage_proxy: fix short reads in parallel range queries
storage_proxy: pass maximum result size to replicas
mutation_partition: use result limiter for digest reads
query: make result_memory_limiter constants available for linker
result_memory_limiter: add accounter for digest reads
idl: allow writers to use any output stream
result_memory_limiter: split new_read() to new_{data, mutation}_read()
idl: is_short_read() was added in 1.6
mutation_partition: honour allowed_short_read for static rows
storage_proxy: fix _is_short_read computation
storage_proxy: disallow short reads if got no live rows
storage_proxy: don't stop after result with no live rows
(cherry picked from commit 868b4d110c)
If a node gets more MUTATION request that it can handle via RPC it will
stop reading from this RPC connection, but this will prevent it from
getting MUTATION_DONE responses for requests it coordinates because
currently MUTATION and MUTATION_DONE messages shares same connection.
To solve this problem this patches moves MUTATION_DONE messages to
separate connection.
Fixes: #1843
Message-Id: <20161201155942.GC11581@scylladb.com>
(cherry picked from commit 0a2dd39c75)
This reverts commit b81a57e8eb.
With exponential range scanning, we should now be able to survive
msb ignore bits of 12, which allows better sharding on large clusters.
(cherry picked from commit 3989e4ed15)
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard. With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.
Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).
If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation. This
optimizes for the dense(r) case, where few partial results are required.
If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys. This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.
We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.
[tgrabiec: trivial conflicts]
Message-Id: <20161220170532.25173-1-avi@scylladb.com>
(cherry picked from commit a1cafed370)
"This patchset ensures the partition limit is enforced at
the storage_proxy level. Uppers layers like the pager may
already be depending on this behavior."
* 'enforce-row-limit/v3' of https://github.com/duarten/scylla:
query_pagers: Don't trim returned rows
select_statement: Don't always trim result set
query_result_merger: Limit rows
mutation_query: to_data_query_result enforces row limit
(cherry picked from commit 3421ebe8be)
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."
* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
storage_proxy: Decrease limits when retrying command
storage_proxy: Don't fetch superfluous partitions
query::result: Add partition count
column_family: Use counters in query::result::builder
query_result_builder: Use the underlying counters
mutation_partition: Count partitions in query_compacted
mutation_partition: Remove tabs in query_compacted
query::result::builder: Add partition count
query_result_merger: Limit partitions
(cherry picked from commit 6bb875bdb7)
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.
Fixes#1952.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
(cherry picked from commit 27fb8ec512)
file output streams take the responsibility of closing the file, they
will close the file as part of closing the stream.
During sstable writing we create sstable object and keep file
references there as well. Sstable object also has responsibility for
closing the files, and does so from sstable::~sstable().
Double close was supposed to be avoided by a construct like this:
writer.close().get();
_file = {};
However if close() failed, which can happen when write-ahead failed,
_file would not be cleared, and both the writer and sstable would
close the file. This will result in a crash in
append_challenged_posix_file_impl::close(), which is not prepared to
be closed twice.
Another problem is that if exception happened before we reached that
construct, we still should close the writer. Currently we don't, so
there's no double close on the file, but that's a bug which needs to
be fixed and once that's fixed double close on _file will be even more
likely.
The fix employed here is to not keep files inside sstable object when
writing. As soon as the writer is constructed, it's the only owner of
the file.
Fixes#1764.
Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f2a63270d1)
RPC messaging service is initialized before the Tracing service, so
we should prevent creation of tracing spans before the service is
fully initialized.
We will use an already existing "_down" state and extend it in a way
that !_down equals "started", where "started" is TRUE when the local
service is fully initialized.
We will also split the Tracing service initialization into two parts:
1) Initialize the sharded object.
2) Start the tracing service:
- Create the I/O backend service.
- Enable tracing.
Fixes issue #1939
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 62cad0f5f5)
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.
After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
(cherry picked from commit 7133583797)
"Original, naive db::make_streaming_reader() implementation created a set
of memtable and sstable readers for every partition range. This caused
bad interaction with the code limiting sstable readers concurrency and
was suboptimal.
This series introduces multi range mutation reader that takes mutation
source and a sorted, disjoint vector of ranges. It creates only a single
set of memtable and sstable readers and fast forwards it to the next
range once the current one is completed."
* 'pdziepak/multi-range-reader/v1' of github.com:cloudius-systems/seastar-dev:
db: use multi range reader for streaming readers
dht: describe split_range[s]_to_shards() guarantees
repair: remove outdated fixme
test/mutation_reader_test: add multi_range_reader test
tests/mutation_reader: extract key creation code
mutation_reader: add multi_range_reader
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.
The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
So far, the only way to combine outputs of multiple readers was to use
combining reader. It is very general and, in particular, supports case
when the readers emit mutations from overlapping ranges.
However, we have cases (e.g. streaming) when we need to read from
several disjoint ranges. Combining reader is a suboptimal solution as it
requires to creating a reader for each range and ignores the fact that
they do not overlap.
This patch introduces multi_range_mutation_reader which takes a
mutation_source and a sorted set of disjoint ranges. Internally, it uses
mutation_reader::fast_forward_to() to move to the next range once the
current one is completed.
We saw the message twice for the same feature check. This is a bit
confusing.
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
This is because
ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE);
ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE);
The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE)
is constructed. The second message is printed when the
ss._range_tombstones_feature is copy-constructed.
Add the helper function to enable the a feature and log the feature is
enabled.
When a feature is enabled, we see
INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled
INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled
in the log.
"This series makes Scylla limit size of query results it produces in case they
grow unreasonably large. This is possible because CQL paging queries do not
guarantee that the returned page is going to have page_size rows and pages
smaller than tha *do not* indicate end of stream. Non-paged queries and Thrift
requests do not have such flexibility and they also get all the requested data
(though their memory usage is still accounted for and may limit paged queries).
There is a maximum result size (1 MB) and all results builders will stop after
reaching it. Moreover, there is a per-shard limitation on the amount of memory
used by all results combined (10%). To avoid tiny results a query has to
reserve (wait if necessary) 4 kB before starting executing, after that it can
consume more memory without any additional waiting provided it is below
individual and shard-local limits.
Enabling the cluster to return less rows than requested also means some changes
for the coordinator. Firstly, if it receives such short result from a replica
retrying it with a larger limit obviously makes no sense whatsoever. Instead,
in such cases the coordinator removes the clustering rows it has incomplate
information about and sends short result back to the client. Moreover, even
if no replica returned short response reconciliation may have made it so. In
this case, the coordinator do not necessairly need to retry the query as well.
Unfortunately, with the current implementation short responses ruin data
queries since they will cause a digest mismatch.
Three new metrics were added:
* database_bytes_total_result_memory -- total memory used by query results
* database_total_operations_short_data_queries -- data queries that were
limited by size, particulary bad as it basically forces coordinator to
retry them as mutation queries
* database_total_operations_short_mutation_queries -- mutation queries limited
by size"
* 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev:
storage_proxy: clean up after primary_key introduction
cql3: allow short reads with paged queries
storage_proxy: handle intentional short reads
storage_proxy: make sure coordinator has complete data
storage_proxy: honour partition limit
storage_proxy: use cmd limits to determine that replica reached end
db: add metrics for short reads and memory used for results
data_query: limit result size
mutation_query: limit result size
db: create result_memory_accounters when starting query
query_builder: add partition_slice getter
reconcilable_result: keep result_memory_tracker object
mutation_compactor: honour stop_iteration from consumers
db: add result_memory_limiter
query: add result size limiter
reconcilable_result: properly propagate short_read flag
query_pagers: handle short reads properly
query: allow short reads
serializer_impl: add serializer for bool_class<Tag>
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).
However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
At the moment the coordinator does not care much for the partition
limit. In particular it doesn't check whether after reconciliation the
result still contains enough partitions.
This patch makes it honour the partition limit and increase it in the
retried queries if necessary.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Coordinator may retry a query with larger limits. However, code
determining whether replica has no more data always used the original
limits. This may cause a livelock.
For example, consider cluster having the following partitions (deletions
cover live cells):
node1:
pk=0, v=0
pk=1, v=1
node2
delete pk=0
delete pk=1
pk=2, v=2
pk=3, v=3
Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is
going to send partitions 0 and 1 while second node is going to send 2
and 3 + tombstones for 0 and 1. The coordinator will decide that it
needs to retry the request with larger row limit since node1 may have
some information about partitions 2 and 3 that are newer than what node2
has sent.
However, when the second response arrives node1 will still sent only two
rows since it has no more data. Because the coordinator uses original
row limit it will not notice that this node reached the end and we are
going to get another retry without making any progress.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.
Moreover, range queries need a way of merging memory usage information
from different shards.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch introduces an infrastrucutre for limiting result size.
There is a shard-local limit which makes sure that all results combined
do not use more than 10% of the shard memory.
There is also an invidual limit which restricts a result to 4 MB.
In order
In order to avoid sending tiny results there is minimum guaranteed size
(4 kB), which the query needs to reserve before it starts producing the
result.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
reconcilable_result can be merged with another or transformed into
query::result. Make sure that short_read information is never lost.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently, the paging implementation assumes that the server retunrs
either as many rows as it was asked for all reached the end. Soon,
that's not going to be true so instead of making any assumptions about
the number of the rows returned use the new "short read" flag to
determine whether there is going to be more data.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.
Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.
Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
"The current criteria for memtable flush is not being respected. The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.
More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.
To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.
Fixes#1918"
* 'flush-criteria-v6' of github.com:glommer/scylla:
config: get rid of memtable_total_space
database: rework dirty memory hierarchy
system keyspace: write batchlog mutation in user memory
database: remove flush_token
database: abstract pressure condition notification
database: encapsulate semaphore_units into a flush_permit
database: remove friendship declaration
database: simplify flush_one
database: make memtable_list aware in cases it can't flush
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.
That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.
This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors
To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.
Fixes#1918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We had a flush_token structure in addition to the flush_permit because
we needed to keep a pointer to the dirty_memory_manager and apply
changes to the region group upon the region destruction. Since Tomek's
latest series, this is no longer needed and now this structure doesn't
have a place in the world anymore. Simplify the code by removing it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We will soon need to hold more than a semaphore_units<> object per
flush, potentially.
Preparation patch for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.
It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.
We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.
It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, we weren't completing a query as early as possible if it
reached the partition limit, we instead had to wait until reaching the
end of the specified partition ranges. This patches fixes that by
including a check to the partition limit in the termination condition.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161213114559.26438-1-duarte@scylladb.com>
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.
During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.
Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.
Fixes#1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
* seastar 0773e98...6fbd792 (2):
> tls: Only run our "verify" function in client session
> Merge "Clean the metric definition" from Amnon
Includes patch from Amnon adjusting the metrics registration due to seastar
API changes.
* seastar 0a74317...0773e98 (6):
> tls: Add support for client cetrificate verification & priority strings
> semaphore: add consume_units
> semaphore: add available_units()
> thread: check need_preempt for threads in a scheduling group as well
> tutorial: fix semaphore example, and text
> stop_iteration: add && and || operators
"This series:
- We can make reader with ranges
- Fix possible use after free of 'si'
- Streaming ranges now are sorted and merged
- Fix shard_begin shard_end end loop in both streaming and repair"
A range now alternates between different shards: the first part of the
range goes to shard X, the next to shard X+1, but after a while we go
back to shard X. So we can't do a simple loop between shard_begin and
shard_end.
Fix by using the newly introduced dht::split_range_to_shards
Use the cf.make_streaming_reader with ranges to simplify the code a bit.
Now that we have the new interface to make readers with ranges, we can
simplify the code a lot.
1) Less readers are needed
before: number of ranges of readers
after: smp::count readers at most
2) No foreign_ptr is needed
There is no need to forward to a shard to make the foreign_ptr for
send_info in the first phase and forward to that shard to execute the
send_info in the second phase.
3) No do_with is needed in send_mutations since si now is a
lw_shared_ptr
4) Fix possible user after free of 'si' in do_send_mutations
We need to take a reference of 'si' when sending the mutation with
send_stream_mutation rpc call, otherwise:
msg1 got exception
si->mutations_done.broken()
si is freed
msg2 got exception
si is used again
The issue is introduced in dc50ce0ce5 (streaming: Make the mutation
readers when streaming starts) which is master only, branch 1.5 is not
affected.
Allow to make a streaming reader with a vector of ranges in addition to
a single range. This will be used soon in following streaming patch.
We can make the reader more efficient later.
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
This patch adds uuid file support for ubuntu system. It also split the
behaviour between restart and daily checks. The first run in r mode and
the second in d mode.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Allows scylla-housekeeping getting the uuid from a file instead of the
command line.
If the file is missing no uuid will be used.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"Function to calculate maximum purgeable timestamp is made 10 times faster when
compacting sstables overlap with 10% of all sstables.
That's possible with an incremental selector that will incrementally select
sstables based on key being compacted.
Currently, we iterate through all non-compacting sstables and consult their
bloom filter to determine max purgeable timestamp, and that will be very
expensive for compactions that are frequently deciding whether or not to purge
tombstones."
* 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla:
compaction: reduce bloom filter overhead with incremental selector
tests: add test for sstable set's incremental selector
sstable_set: introduce incremental selector
compatible_ring_position: add function to return token
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.
Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.
Fixes#1322.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Incrementally select sstables from sstable set using token
in ascending order.
For leveled strategy, it returns all sstables that belong
to current interval. For other strategies, it just return
all sstables from the set.
Useful for compaction which needs all sstables that overlap
with key being currently compacted to calculate maximum
purgeable timestamp.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.
That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.
This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
When row cache is disabled, update_cache() will do nothing to the
memtable. Active readers may keep the memtable alive for unbounded
amount of time, preventing it from going away. This doesn't play well
with virtual dirty accounting. Soon before calling update_cache(), the
memory which was subtracted during flush is added back to the amount
of virtual dirty memory. If there was write pressure all along, we
will be at the dirty memory limit. When we give back subtracted memory
this will put virtual dirty way above the limit. This will stall all
writes until another memtable flush drags virtual dirty down or
readers finally release the memtable. We want to prevent upward
jumps of virtual dirty.
First part of the fix is to ensure that as long as the memtable's
region is in the dirty group, we will not revert flushed memory. This
must happen synchronously from region's memory being removed from the
group in order to prevent upward virtual dirty jumps. To make this
easier, tracking of flushed memory was moved to the memtable object.
Another part of the fix is to gradually clear the memtable when cache
is disabled in a similar fashion as when it's moved to cache. This
ensures that the actual memory held by memtable's region is released
sooner than it dies.
Refs #1879
"Due to my misreading of Cassandra code, I thought it would ignore new
components in the Statistics component; however, it doesn't, and the change
(introduced in bdd11648ac ("sstables: add
intra-node sharding metadata") breaks sstable2json and likely any
Cassandra code that touches sstables.
To fix, move the sharding data into a new component ("Scylla.db"), which
Cassandra does ignore. The new component is designed to be extensible so
we don't experience the same issue later on."
* tag 'asias/repair/subranges/refactor_fix/v1' of github.com:cloudius-systems/seastar-dev:
repair: Limit the number of sub ranges
repair: Use estimated_keys_for_range in repair_cf_range
repair: Extract the target_partitions into repair_info class
repair: Put request_transfer_ranges into repair_info class
repair: Introduce check_failed_ranges helper
repair: Introduce do_streaming helper
repair: Make the neighbors const reference
repair: Introduce repair_info
repair: Attach the repair id in the stream plan name
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.
The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.
Introduced in abe7358767.
dtest failure:
commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup
Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
- Add "coordinator" and "replica" categories
- Use a new seastar/metrics_registration framework
* 'rearrange-storage-proxy-stats-v4' of github.com:cloudius-systems/seastar-dev:
service::storage_proxy: rework the collectd counters registration
service/storage_proxy: regroup collectd statistics
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
Allow declaring discriminated unions (with an enum type as the
discriminant and any sstable serializable type as a value) and sets
of these unions, with the disciminant as the key. Parsers and writers
are auto-generated.
Currently housekeeping timer won't be reset when we restart scylla-server.
We expect the service to be run at each start, it will be consistent with
upstart script in Ubuntu 14.04
When we restart scylla-server, housekeepting timer will also be restarted,
so let's replace "OnBootSec" with "OnActiveSec".
Fixes: #1601
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <a22943cc11a3de23db266c52fd476c08014098c4.1480607401.git.amos@scylladb.com>
To reduce duplicated code and simplified scripts introduce scylla_lib.sh
for shellscripts which provides functions to classify distributions,
and load all sysconfig files.
This also fixes script bugs to misdetect Debian and RHEL.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-2-git-send-email-syuu@scylladb.com>
Since Ubuntu 15.10/16.04 still uses Upstart to manage GUI session (not as init), when we directly launch Scylla on Ubuntu's GUI Terminal(not using systemctl or initctl), raise(SIGSTOP) mistakenly calls (Because GUI session has "UPSTART_JOB" environment variable, won't happen when running Scylla as systemd service).
To avoid this, we need to verify UPSTART_JOB == "scylla-server".
If it's part of GUI session UPSTART_JOB has to be "unity7", we need to avoid raise(SIGSTOP) in that case.
Fixes#1199
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480620421-28967-1-git-send-email-syuu@scylladb.com>
So that memory is released gradually (impacting latency less) and
sooner than when memtable is destroyed. Active readers may keep the
memtable alive for unbounded amount of time.
Refs #1879
When memtable is flushing, it subtracts _flushed_memory from groups's
size to gradually allow more writes. Ideally _flushed_memory would be
equal to region's size when flush ends, so the group's size would
reach zero. When the memtable and its region are gone the group size
should remain the same as after the flush. This is ensured by adding
back _flushed_memory to group's size right before the region is
removed from the group.
Calling clear() before region is removed from the group breaks the
accounting because it will shrink the region, but will not affect the
amount of memory subtracted due to _flushed_memory. So group's size
would decrease more than we want (twice the region's size). The fix is
to change clear() so that it reverts _flushed_memory by the amount by
which the region size is reduced. This will keep the groups's size
constant as long as _flushed_memory > 0.
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:
boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));
This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
If exception is triggered early in boot when doing an I/O operation,
scylla will fail because io checker calls storage service to stop
transport services, and not all of them were initialized yet.
Scylla was failing as follow:
scylla: ./seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local()
[with Service = gms::gossiper]: Assertion `local_is_initialized()' failed.
Aborting on shard 0.
Backtrace:
0x000000000048a2ca
0x000000000048a3d3
0x00007fc279e739ff
0x00007fc279ad6a27
0x00007fc279ad8629
0x00007fc279acf226
0x00007fc279acf2d1
0x0000000000c145f8
0x000000000110d1bc
0x000000000041bacd
0x00000000005520f1
0x00007fc279aeaf1f
Aborted (core dumped)
Refs #883.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <963f7b0f5a7a8a1405728b414a7d7a6dccd70581.1479172124.git.asias@scylladb.com>
A range is diveded into N sub ranges so that each sub range contains 100
partitions. So N depends on the number of partitions in that range. N
can grow unbounded and the memory usage of vector to hold these sub
ranges can go unbouded.
Limit the max number of sub ranges a range can divided into.
The downside is that the limited sub range will make we include more
partitions in the checksum.
Fixes#1917
Use the new seastar's metrics_registration framework:
- Change the registration syntax.
- Add a long description for each counter.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Instead of putting all statistics under the same "storage_proxy" category
separate them into 2 groups according to where the corresponding counters
are updated:
- "storage_proxy_replica"
- "storage_proxy_coordinator"
Fixes#1763
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.
The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.
That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished. The
following sequence of events with its deferring points will ilustrate
it:
on_timer:
return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {
At this point, the segment id is already allocated.
new_segment():
if (_reserve_segments.empty()) {
[ ... ]
return allocate_segment(true).then ...
At this point, we have a new segment that has an id that is higher than
the previous id allocated.
Then we resume the execution from the deferring point in on_timer():
i = _reserve_segments.emplace(i, std::move(s));
The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.
Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.
This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).
Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>
The information (last compacted keys) is lost after node is restarted
or schema is updated, which causes strategy to be rebuilt.
We need it for strategy to guarantee uniform distribution of token
range across sstables, or we could end up with 1 sstable of level L
overlapping with lots of sstables of level L+1, and that results in
a compaction of undesired length.
That information can be generated from scratch by getting last key
of newest sstable in each level > 0.
Fixes#1906.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <35ebd15977d5a8418239febb160c796cdc0e98fa.1480533805.git.raphaelsc@scylladb.com>
Streaming memtable have a delayed mode where many flushes are coalesced
together into one, with the actual flush happening later and propagated
to all the previous waiters.
However, the timer that triggers the actual flush was not using the
newly introduced flush infrastructure. This was a minor problem because
those flushes wouldn't try to take the semaphore, and so we could have
many flushes going on at the same time.
What was a potential performance issue became a correctness issue when
we moved the reversal of the dirty memory accounting out of
revert_potentially_cleaned_up_memory() into remove_from_flush_manager().
Since the latter is only called through the flush infrastructure, it
simply wasn't called. So the deferral of the reversal exposed this bug.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>
This fix splits build_ami.sh --repo to three different options:
--repo-for-install is for Scylla package installation, only valid
during AMI construction.
--repo-for-update will be stored at /etc/yum.repos.d/scylla.repo, to
receive update package on AMI.
--repo is both, for installation and update.
Fixes#1872
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480438858-6007-1-git-send-email-syuu@scylladb.com>
"The goal of this series is to prevent unbounded memory use
in cases when requests are timing out. Write requests which timed
out may still occupy memory for a while because of local mutation
application. This memory is not accounted for and can build up.
First part of the fix changes local mutation application so that it times out
at about the same time as the request handler. Then the life
time of the request handler is extended to cover any background activity
of that request which hasn't timed out yet. This has two main effects:
(1) by timing out local writes we prevent build up of background activity
for timed out requests
(2) we ensure that memory used by background activity is not left
behind unaccounted for. This will prevent CQL server from admitting
more requests than memory usage limit allows.
Fixes #1756."
* tag 'tgrabiec/prevent-oom-on-timeouts-v5' of github.com:cloudius-systems/seastar-dev:
storage_proxy: Do not flood logs with timeout errors
database: Add counter for timed out writes
storage_proxy: Delay timeout response until background work ceases
storage_proxy: Propagate timeout to local writes
storage_proxy: Use shared ownership for abstract_write_response_handler
storage_proxy: Add counter for all alive write handlers
db: Allow writes to be timed out
db: Introduce counters for failed reads and writes
commitlog: Allow allocations to be timed out
utils/logalloc: Add ability to timeout run_when_memory_available() task
utils/flush_queue: Add ability to wait with a timeout
Timeout errors are flooding the log after local mutate can time
out. We don't log remote mutate timeouts, so for consistency we won't
log local ones as well.
There is a database counter for timed out writes which can be
consulted in order to check if they're occuring.
Perhaps this would be better solved by a generic log message
throttling/coalescing mechanism, but that's not ready yet.
Write requests which timed out may still occupy memory for a while due
to local write. It should time out soon as well but there is a time
window in which it has not yet. If we don't delay timeout response,
the request would be seen as not consuming any memory too early. This
in turn would cause the CQL server to allow more requests than we
want. In some cases causing OOM or exceeding memory limits and causing
excessive cache eviciton.
Fixes#1756.
Currently the counter uses _response_handlers.size(), but after later
patches we may have an active (timed out) write with no response
handler, so count live instances instead.
Problem will cause size tiered to return small jobs when there are
more than max_threshold sstables of similar size. For example, if
max_threshold is 32, and there are 36 sstables of similar size,
strategy will only return 4 sstables to be compacted. That's because
we incorrectly create a new bucket when it meets the max threshold.
What we should do is to allow buckets to grow beyond max threshold
and trim them when selecting the most suitable one for compaction.
Important to mention that estimation for size tiered will now
work better when there are more than max_threshold sstables of
similar size.
Fixes#1901.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <080bad70d6cb86eaf52ac1bdd6765ac47aab5b03.1478316140.git.raphaelsc@scylladb.com>
This reverts commit 0e9b75d406.
commitlog_test fails with this:
Running 14 test cases...
ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed
ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN 2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN 2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required)
*** 1 failure detected in test suite "tests/commitlog_test.cc"
WARN 2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
New sstables are loaded and added in parallel, meaning that scylla can
potentially return stale data if a new sstable containing a tombstone
wasn't loaded yet. Compaction should also not run until all new sstables
are added for similar reasons.
Fix is about separating blocking and non-blocking steps to allow
atomic add of multiple new sstables.
Fixes#1368.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.
The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.
That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished. The
following sequence of events with its deferring points will ilustrate
it:
on_timer:
return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {
At this point, the segment id is already allocated.
new_segment():
if (_reserve_segments.empty()) {
[ ... ]
return allocate_segment(true).then ...
At this point, we have a new segment that has an id that is higher than
the previous id allocated.
Then we resume the execution from the deferring point in on_timer():
i = _reserve_segments.emplace(i, std::move(s));
The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.
Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.
This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).
Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.
This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.
Fixes#1616"
* 'virtual-estimates/v4' of github.com:duarten/scylla:
size_estimates_virtual_reader: Add unit test
db: Delete size_estimates_recorder
size_estimates: Add virtual reader
column_family: Add support for virtual readers
storage_service: get_local_tokens() returns a future
nonwrapping_range: Add slice() function
range: Find a sequence's lower and upper bounds
system_keyspace: Build mutations for size estimates
size_estimates: Store the token range as bytes
range_estimates: Add schema
murmur3_partitioner: Convert maximum_token to sstring
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.
During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.
This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
GCC 5.3.1 was unable to convert bound to optional<bound>.
sstables/sstables.cc:2494:123: error: no matching function for call to
‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position,
dht::ring_position)’
(dtr.right.exclusive ? dht::ring_position::starting_at :
dht::ring_position::ending_at)(std::move(t2)));
In file included from ./dht/i_partitioner.hh:52:0,
from ./query-request.hh:28,
from ./clustering_key_filter.hh:27,
from sstables/sstables.hh:35,
from sstables/sstables.cc:38:
./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range(
const wrapping_range<U>&) [with T = dht::ring_position]
explicit nonwrapping_range(const wrapping_range<T>& r)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>
Since not all distributions have a version of LZ4 with
LZ4_compress_default(), we use it conditionally.
This is specially important beginning with version 1.7.3 of LZ4,
which deprecates the LZ4_compress() function in favour of
LZ4_compress_default() and thus prevents Scylla from compiling
due to the deprecated warning.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161124092339.23017-1-duarte@scylladb.com>
In Thrift, SliceRange defines a count that limits the number of cells
to return from that row (in CQL3 terms, it limits the number of rows
in that partition). While this limit is honored in the engine, the
Thrift layer also applies the same limit, which, while redundant in
most cases, is used to support the get_paged_slice verb.
Currently, the limit is not being reset per Thrift row (CQL3
partition), so in practice, instead of limiting the cells in a row,
we're limiting the rows we return as well. This patch fixes that by
ensuring the limit applies only within a row/partition.
Fixes#1882
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161123220001.15496-1-duarte@scylladb.com>
With the default value of 12, a node's range is partitioned into
4096 * smp::count sub-ranges which are queried sequentually for a range
scan. If the number of rows in the table is smaller than the required
result size, we will query all of them. This can take so long that we
time out.
A better fix is to query multiple sub-ranges in parallel and merge them,
but for that we need to resurrect the non-sequential merger.
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively. Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.
This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes. Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.
In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.
Fixes #1277."
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.
This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy. In effect, with changes the token range layout from
shard 0
shard 1
...
shard S-1
to
shard 0
shard 1
...
shard S-1
shard 0
shard 1
...
shard S-1
...
shard 0
shard 1
...
shard S-1
Where the number of repetitions of the block is 2^(ignored msb bits).
For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
Instead of calculating the owning shard from the sstable's partition
key range, delegate to the new sstable method for getting owning shard
infomation. This insulates us from changes in the sharding algorithm.
When we load an sstable, we don't know beforehand which shards it belongs
to; we don't want to open it until we do. Add a method that allows us
to read just the sharding data, without opening anything else.
Add a metadata component that describes token ranges that are spanned by
this sstable. With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
We have a semaphore controlling the amount of background work generated
by the memtable flush process. However, because we are not moving it
inside the memtable post-flush continuation, the units are being
released when we star the flush and not when we finish it.
That's not the intended behavior and that can cause flushes to
accumulate.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>
Instead of calculating the offset for each statistic component manually,
use a loop to iterate over all components, accumulating the offset as we
go along.
Introduce a new function that reuses the file_writer code to compute
the serialized size of an sstable object, by serializing it into memory
and discarding the result.
write() doesn't need to change its input; so change it to const.
The only snag is that describe_type() isn't and can't be made const-correct,
so cheat when it is called and const_cast the input.
This helps in writing a generic serialized_size() that is const correct,
in the next patch.
Recently we have changed our shutdown strategy to wait for the
_request_controller semaphore to make sure no other allocations are
in-flight. That was done to fix an actual issue.
The problem is that this wasn't done early enough. We acquire the
semaphore after we have already marked ourselves as _shutdown and
released the timer.
That means that if there is an allocation in flight that needs to use a
new segment, it will never finish - and we'll therefore neve acquire
the semaphore.
Fix it by acquiring it first. At this point the allocations will all be
done and gone, and then we can shutdown everything else.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>
storage_proxy has an optimization where it tries to query multiple token
ranges concurrently to satisfy very large requests (an optimization which is
likely meaningless when paging is enabled, as it always should be). However,
the rows-per-range code severely underestimates the number of rows per range,
resulting in a large number of "read-ahead" internal queries being performed,
the results of most of which are discarded.
Fix by disabling this code. We should likely remove it completely, but let's
start with a band-aid that can be backported.
Fixes#1863.
Message-Id: <20161120165741.2488-1-avi@scylladb.com>
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.
Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.
That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.
Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.
At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.
The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.
Fixes#1870
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
Now that access to the size_estimates system is virtualized, we no
longer need the recorder.
Fixes#1616
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch add a virtual mutation_reader so that queries
to the size_estimates system table are handled by the engine
without needing to perform any IO.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Virtual readers allow queries to selected tables, usually system
tables, to be answered by the engine. This is useful for tables which
aren't written by users and whose contents can be calculated on
demand.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the get_local_tokens() function in storage_service
to return a future instead of requiring running under a seastar::thread.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch add the slice() function to nonwrapping range, which uses
its bounds to slice an input sequence.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts a pair of functions from mutation_partition to
calculate the lower and upper bounds of a sequence from a
nonwrapping_range.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a function to system_keyspace responsible for creating
a mutation to a partition of the size_estimates system table from a
set of range_estimates.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the range_estimates struct so that the tokens are
represented as utf8 encoded bytes. This will make future patches
require less conversions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures we can convert the maximum_token to an sstring.
For Cassandra, the minimum and maximum tokens have the same
representation. So, we use the string representation of the
maximum_token for the maximum_token.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This series is just a cleanup which intention is to deal with all
confusion related to the way T::memory_usage() functions work.
* T::memory_usage() which returned external memory usage are renamed
to T::external_memory_usage()
* T::memory_usage() is introduced where needed to avoid repeating
sizeof(T) + T::external_memory_usage()"
Paweł Dziepak (6):
rename memory_usage() to external_memory_usage() where applicable
streamed_mutation: add memory_usage() to mutation fragment types
keys: add memory_usage()
partition_snapshot_accounter: use range_tombstone::memory_usage()
mutation_rebuilder: use memory_usage()
frozen_mutation: use memory_usage()
In the last iterations of this patchset, we have moved explicit flushes
to acquire the semaphore directly and the coalescing inside the
memtable_list. As a result, we are no longer keeping any kind of action
for them inside the condition variable. Checking for them has no longer
a purpose.
This is a cleanup patch that remove does checks.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>
If a Column Family is non-durable, then its flushes will never create a
memtable flush reader. Our current flush logic depends on that being
created and destroyed to release the semaphore permits on the flush.
We will remove the permits ourselves it there is an exception, but not
under normal circumnstances. Given this issue, however, it would be more
adequate to always try to remove the permits after we flush. If the
permits were already removed by the flush reader, then this test will
just see that the permit is not in the map and return. But if it is
still there, then it is removed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>
This patch introduces memory_usage() to static_row, clustering_row and
range_tombstone so that we can avoid repeating sizeof(T) +
x.external_memory_usage().
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"Fixes #1856
Commitlog replay reads are being issued without a priority. That means
they will lose to compaction every time."
* 'issue-1856-v2' of github.com:glommer/scylla:
commitlog: use read ahead for replay requests
commitlog: use commitlog priority for replay
commitlog: close replay file
We supported install CentOS7 .rpm on RHEL7, but we haven't supported
building on RHEL7, since there is little difference between CentOS,
and that causes build error.
This patch fixes the error, now we can produce .rpm for RHEL7 wihout
using CentOS.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479431134-8032-1-git-send-email-syuu@scylladb.com>
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
the explicit flushes no longer wait on the condition variable. So we
don't.
- We now wait on the stop() flushes (regardless of their return status)
so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
handler
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
Aside from putting the requests in the commitlog class, read ahead
will help us going through the file faster.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now replay is being issued with the standard seastar priority.
The rationale for that at the time is that it is an early event that
doesn't really share the disk with anybody.
That is largely untrue now that we start compactions on boot.
Compactions may fight for bandwidth with the commitlog, and with such
low priority the commitlog is guaranteed to lose.
Fixes#1856
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Replay file is opened, so it should be closed. We're not seeing any
problems arising from this, but they may happen. Enabling read ahead in
this stream makes them happen immediately. Fix it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.
The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.
We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.
Fixes#1840.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
"This patchset allows Scylla to determine the size of a memtable instead
of relying in the user-provided memtable_cleanup_threshold. It does that
by allowing the region_group to specify a soft limit which will trigger
the allocation as early as it is reached.
Given that, we'll keep the memtables in memory for as long as it takes
to reach that limit, regardless of the individual size of any single one
of them. That limit is set to 1/4 of dirty memory. That's the same as
last submission, except this time I have run some experiments to gauge
behavior of that versus 1/2 of dirty memory, which was a preferred
theoretical value.
After that is done, the flush logic is reworked to guarantee that
flushes are not initiated if we already have one memtable under flush.
That allow us to better take advantage of coalescing opportunities with
new requests and prevents the pending memtable explosion that is
ultimately responsible for Issue 1817.
I have run mainly two workloads with this. The first one a local RF=1
workload with large partitions, sized 128kB and 100 threads. The results
are:
Before:
op rate : 632 [WRITE:632]
partition rate : 632 [WRITE:632]
row rate : 632 [WRITE:632]
latency mean : 157.8 [WRITE:157.8]
latency median : 115.5 [WRITE:115.5]
latency 95th percentile : 486.7 [WRITE:486.7]
latency 99th percentile : 534.8 [WRITE:534.8]
latency 99.9th percentile : 599.0 [WRITE:599.0]
latency max : 722.6 [WRITE:722.6]
Total partitions : 189667 [WRITE:189667]
Total errors : 0 [WRITE:0]
total gc count : 0
total gc mb : 0
total gc time (s) : 0
avg gc time(ms) : NaN
stdev gc time(ms) : 0
Total operation time : 00:05:00
END
After:
op rate : 951 [WRITE:951]
partition rate : 951 [WRITE:951]
row rate : 951 [WRITE:951]
latency mean : 104.8 [WRITE:104.8]
latency median : 102.5 [WRITE:102.5]
latency 95th percentile : 155.8 [WRITE:155.8]
latency 99th percentile : 177.8 [WRITE:177.8]
latency 99.9th percentile : 686.4 [WRITE:686.4]
latency max : 1081.4 [WRITE:1081.4]
Total partitions : 285324 [WRITE:285324]
Total errors : 0 [WRITE:0]
total gc count : 0
total gc mb : 0
total gc time (s) : 0
avg gc time(ms) : NaN
stdev gc time(ms) : 0
Total operation time : 00:05:00
END
The other workload was the workload described in #1817. And the result
is that we now have a load that is very stable around 100k ops/s and
hardly any timeouts, instead of the 1.4 baseline of wild variations
around 100k ops/s and lots of timeouts, or the deep reduction of
1.5-rc1."
* 'issue-1817-v4' of github.com:glommer/scylla:
database: rework memtable flush logic
get rid of max_memtable_size
pass a region to dirty_memory_manager accounting API
memtable: add a method to expose the region_group
logalloc: allow region group reclaimer to specify a soft limit
database: remove outdated comment
database: uphold virtual dirty for system tables.
If we're talking to just one replica, the digest is not going to be used,
so better not to calculate it at all. The optimization helps with
LOCAL_ONE queries where the result is large, but does not contain large
blobs (many small rows).
This patch adds a digest_algorithm parameter to the READ_DATA verb that
can take on two values: none and MD5 (default), and sets it to none when
we're reading from one replica.
In the future we may add other values for more hardware-friendly digest
algorithms.
Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>
Currenlty we make the mutation readers for streaming at different
time point, i.e.,
do_for_each(_ranges.begin(), _ranges.end(), [] (auto range) {
make a mutation reader for this range
read mutations from the reader and send
})
If there are write workload in the background, we will stream extra
data, since the later the reader is made the more data we need to send.
Fix it by making all the readers before starting to stream.
Fixes#1815
Message-Id: <1479341474-1364-2-git-send-email-asias@scylladb.com>
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
The way we currently flush memtables, we seal the current one but wait
on a semaphore for the actual flush to proceed.
This is pointless, because if the flush is not proceeding we'll use up
memory for the new entries anyway, be them in a newly opened memtable or
not. As a matter of fact, by opening a new memtable we are foregoing
coalescing opportunities.
After recent changes to the flush paths, we are now in a position to do
differently. We move the semaphore earlier, and if we can't acquire it
we keep appending to the current memtable.
For explicit flushes, we'll queue and prioritize them over memory-based
flushes. This has the nice property of potentially coalescing various
flushes for the same CF into one.
Coalescing flushes for the same CF is particularly helpful for
commitlog-initiated flushes that can't complete within the flush period.
What we see currently, is that under heavy load the commitlog will keep
sealing memtables adding to the existing load.
Another interesting property of this approach is that we can keep the
disk utilization higher, by allowing a new flush to start before the
memtable is fully sealed. By design, every time a memtable is finished
flushing it will call revert_potentially_cleaned_up_memory() to revert
the virtual memory charges. That is the perfect moment for us to act.
It indicates that all the data flushing part is done.
The way we'll do it is by keeping the semaphore_units alive for this
memtable. When the flush ends, we destroy that object. This will
effectively trigger the next flush if there is a next flush that can be
initiated.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
After recent changes to the memtable code, there is no reason for us to
uphold a maximum memtable size. Now that we only flush one memtable at a
time anyway, and also have soft limit notifications from the
region_group_reclaimer, we can just set the soft limit to the target
size and let all of that be handled by the dirty_memory_manager.
It does have the added property that we'll be flushing when we globally
reach the soft limit threshold. In conditions in which we have multiple
CF writes fighting for memory, that guarantees that we will start
flushing much earlier than the hard limit.
The threshold is set to 1/4 of dirty memory. While in theory we would
prefer the memtables to go as big as 1/2 of dirty memory, in my
experiments I have found 1/4 to be a better fit, at least for the
moment.
The reason for such behavior is that in situations where we have slow
disks, setting the soft limit to 1/2 of dirty will put us in a situation
in which we may not have finished writing down the memtable when we hit
the limit, and then throttle. When set the threshold to 1/4 of dirty, we
don't throttle at all.
This behavior could potentially be fixed by not doing the full
memtable-based throttling after we do the commitlog throttling, but that
is not something realistic for the moment.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We would like to know from which region is a particular flush coming
from, and account accordingly. The reasoning behind that, is that soon
we'll be driving the flushes internally from the dirty_memory_manager
without explcitly triggering them.
We need to start a flush before the current one finishes, otherwise
we'll have a period without significant disk activity when the current
SSTable is being sealed, the caches are being updated, etc.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
That is technically not needed because a memtable inherits from group. So
whenever we have a memtable, we can use it's group() method to obtain a
group for it, and then from there go to the region_group.
However, region() is a const method in the memtable, so we have to play
trick with the const_cast, or remove the constness from the region. An
alternative to that, which I prefer, is to expose a method for the
region_group directly from the memtable object that does the right thing
and bypasses all that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The region_group_reclaimer will let us know every time we are over the
limit we have specified for memory usage.
However, For some applications, we would be interested in knowing about
memory build up earlier, so we can start doing something about it before
we reach that condition.
This patch introduce soft limit notifications for the
region_group_reclaimer. After this patch is applied, start_reclaim() is
called earlier, and stop_reclaim() later, after the soft condition is
abated.
There are methods that allow one to easily test if the pressure
condition is a soft limit condition or a hard, threshold condition and
act accordingly. Whether to act on both conditions or just one of them
is up to the application.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently the virtual dirty mechanism is not properly set for system
tables. We haven't divided the system table allowance by two, which
means it won't start thottling earlier as it was supposed to.
In practice, this has little effect because system table requests are
very well behaved, their sizes well known, and they tend to be
force-flushed. But we should be consistent.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"Cache entries for wide partitions are usually smaller than other
entries and the cost of recreating them is higher so it makes sense
to keep them longer than ordinary entries."
This will allow us to see how big is an amount
of evictions of cached info about wide partitions.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Query pager needs to handle results that contain partitions with
possibly multiple clustering rows quite differently than results with
just one row per partition (for example a page may end in a middle of
partition). However, the logic dealing with partitions with clustering
rows doesn't work correctly for SELECT DISTINCT queries, which are
much more similar to the ones for schemas without clustering key.
The solution is to set _has_clustering_keys to false in case of SELECT
DISTINCT queries regardless of the schema which will make pager
correctly expect each partition to return at most one rows.
Fixes#1822.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478612486-13421-1-git-send-email-pdziepak@scylladb.com>
Now that the histogram has its own unit expressed in its template
parameter, there is no reason to convert it to nano just so we may need
to convert it back if the histogram needs another unit.
This patch will keep everything as a duration until last moment, and
then we'll convert when needed.
This was suggested by Amnon.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>
After the conversion to nonwrapping ranges, construct_range_to_endpoint_map()
may be called with semi-infinite token ranges, but it does not expect this,
calling nonwrapping_range::end()->value() unconditionally.
Fix by checking whether this is a semi-infinite range on the right, and
replace ->value() by maximum_token() instead.
Fixes `nodetool describering` (once more).
Message-Id: <1478983010-29630-1-git-send-email-avi@scylladb.com>
Exception handling was broken because after io checker, storage_io_error
exception is wrapped around system error exceptions. Also the message
when handling exception wasn't precise enough for all cases. For example,
lack of permission to write to existing data directory.
Fixes#883.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>
"JMX metrics were found to be either not showing, or showing absurd
values. Turns out there were multiple things wrong with them. The
patches were sent separately but conflict with one another. This series
is a collection of the patches needed to fix the issues we saw.
Fixes#1832, #1836, #1837"
We are tracking latencies in microseconds, but almost everywhere else
they are reported in microseconds. Instead of just converting, this
patch tries to be a bit more future proof and embed the unit into the
type - and we then default to microseconds.
I have verified that the JMX measures now report sane values for both
the storage proxy and the column family. nodetool cfhistograms still
works fine. That one is reported in nanoseconds, but through the
estimated_histogram, not ihistogram.
Fixes#1836
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have recently fixed a bug due to which the constructor parameters for
moving average were inverted, leading to the numbers being just plain
wrong. However, the calculation of alpha was already inverted, meaning
it was right by accident and now that's wrong.
With the wrong alpha, the values we see are still correct, but they move
very quickly. The intention of this code is obviously to smooth things
out.
This was found out by Nadav. I have tested and confirmed that the smoothing
factor now works as expected.
Fixes #1837
Signed-off-by: Glauber Costa <glauber@scylladb.com>
moving_averages constructor is defined like this:
moving_average(latency_counter::duration interval, latency_counter::duration tick_interval)
But when it is time to initialize them, we do this:
... {tick_interval(), std::chrono::minutes(1)} ...
As it can be seen, the interval and tick interval are inverted. This
leads to the metrics being assigned bogus values.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>
Snapshot destructor may free some objects managed by the LSA. That's why
partition_snapshot_reader destructor explicitly destroys the snapshot it
uses. However, it was possible that exception thrown by _read_section
prevented that from happenning making snapshot destoryed implicitly
without current allocator set to LSA.
Refs #1831.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478778570-2795-1-git-send-email-pdziepak@scylladb.com>
After 7c28ed, the schemas defined in the test became compressed by
default. This patch changes the test so that it is explicit about
which schemas shouldn't define a compressor.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478646530-5558-1-git-send-email-duarte@scylladb.com>
"Mostly small changes/additions to the API calls to match Cv3
requirements/semantics, i.e. updated scylla-jmx can implement required
nodetool etc calls in a working fashion."
The CQL 3.1 documentation specifies that for disabling compression,
users should use an empty string:
ALTER TABLE mytable WITH COMPRESSION = {'sstable_compression': ''};
However, Cassandra also accepts the absence of the sstable_compression
option to disable compression. The patch 7c28ed prevented this behavior
in Scylla, which this patch aims to fix.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478639499-4183-1-git-send-email-duarte@scylladb.com>
Current speculation target selection logic has several bugs in multi-dc
setup. It may select a non local target for CL=LOCAL and it may select
more than one target to speculate, one of which is non local.
Examples:
1. Two dataceneters: DC1 RF 2, DC2 RF 2 and read with LOCAL_QUORUM.
In this scenario db::filter_for_query() will return both replicas from
local DC and speculation target selection logic will peek one one which
will be in different DC.
2. Two dataceneters: DC1 RF 2, DC2 RF 2 and read with LOCAL_ONE + RRD.DC_LOCAL
In this scenario db::filter_for_query() will return all nodes in local DC and
there already be enough nodes to speculate, but current logic will add
one node from non local dc as a speculation target.
The patch below fixed both of those scenarios.
Message-Id: <20161103154637.GS7766@scylladb.com>
Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>
partition_range passed to row_cache::make_reader
has to be kept alive as long as the resulting reader
is used.
Otherwise weird things start to happen.
This used to work just because of a pure luck.
When I started changing the row_cache implementation
I run into very weird behaviors for this tests.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>
Commit 8fca1887c2 ("storage_service: fix range wrapping in
describe_ring") fixed incorrect range wrapping code for describe_ring,
but fails when the number of endpoints for a token is greater than one,
because the endpoints are stored in an unordered vector.
Fix by comparing the endpoints in a way that ignores their order.
Message-Id: <1478460826-15923-1-git-send-email-avi@scylladb.com>
Fixes#1775
stream lacks a check "is_open", which is a bummer. We have to both
prevent exception propagation and add a flag of our own to make sure
exceptions in producer code reaches consumer, and does not simply
get lost in the reactor.
Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>
This patchset adds missing properties to the create_view_statement,
such as whether the view is compact or the order of its clustering
columns.
Fixes#1766
"The atomic sstable deletion provides exception safety at the cost of
quadratic behavior in the number of sstables awaiting deletion. This
causes high cpu utilization during startup.
Change the code to avoid quadratic complexity, and add some unit tests.
See #1812."
In order to ensure exception safety, the atomic sstable deletion code
creates a copy of the list of sstables pending deletion, modifies that
copy, and then replaces the original data with the copy. This guarantees
that any exception does not change the data, since the assignment does
not require allocation.
However, it does result in quadratic behavior. During startup, all
sstables are loaded on each shard, and each shard deletes sstables that
are do not have any partitions served by that shard; this results in
almost all sstables being deleted from all shards, with all that work
going to shard 0; the list grows to O(nr sstables), and there are
O((nr sstables) * (nr shards)) operations to perform.
Fix by replacing the copy-modify-assign method with an in-place update,
but one that is designed to only commit changes after all allocations
have been made; in addition, instead of using a list, use a hash table,
removing another source of quadratic behavior.
Fixes#1812 (the quadratic beahvior part).
"Currently, partition range queries are processed in parallel on all
shards. This is inefficient because we are likely to drop the results
from all but one shard, assuming a well-populated column family. We
are multiplying our work by a factor of smp::count.
While this is worthwhile in its own right, it is really an excuse to
sneak in the range/shard generator (patch 5), which is preliminary for
a new sharding algorithm, dividing tokens among shards based on the
middle-significant bits rather than the most-siginificant bits (which
alias with vnodes)
Fixes #1573."
Instead of asking a shard for cmd->partition_limit and cmd->row_limit,
just ask it for the number of partitions and rows still needed to
satisfy the query. This removes the need to trim the shard's result.
Since every shard might cause the row_limit quota to be satisfied, every
shard might be the last one we need. Hence it is better to process shards
sequentially, stopping if the quota is reached or the range is exhausted.
The original code tried to yield to reduce latency, but this is now
unnecessary, as we're doing a lot less work per iteration (if it becomes
necessary, we should do it on the replica shard, not the coordinating shard).
Building on the single-range sharder, add a sharder for vectors of
partition ranges. This helps with wrapped ranges, which are translated
into a vector containing two shards.
Divides a ring_position range into a sequence of shard/range pairs. This
allows sequential iteration over shards in ring order.
The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page. With the generator, we can switch to sequential
execution.
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.
To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.
It is a sort-of inverse to shard_of(), in that
shard_of(token_for_next_range(t)) == shard_of(t) + 1
- Add a inserts, updates, deletes members to cql_stats.
- Store cql_stats& in a modification_statement and increment the corresponding counter according to the value of a "type" field.
- Store cql_stats& in a batch_statement and increment the statistics for each BATCH member.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Add a "reads" counter to a cql3::cql_stats struct.
- Store a reference for a query_processor::_cql_stats in the select_statement object.
- Increment a "reads" counter where needed.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Right now we are calculating latencies only when we are about to add an
item to the memtable.
That's incorrect and misleading, for two reasons. First, it leaves the
commitlog latencies out. But second, it is done after the memtable wall
effect is applied, which means we are not counting throttle time neither
in the memtables or in the commitlog.
To do that, we'll start the latency_counter object as soon as possible
and move it all the way to apply_in_memory(). That should span the
entire write operation.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>
There are two variants of apply_in_memory() being called in do_apply():
with and without the commitlog. The main differences are that when the
commitlog is involved, we need to wait for its future to complete before
moving to apply_in_memory. That can easily be factored out by providing
an always-ready future if we don't have the commitlog enabled, and
waiting on that.
The second, is that the commitlog version can cause apply_in_memory to
generate an exception if there is replay position reordering. However,
there is no harm in appending the exception handler to both versions. In
one of them it's an impossible exception, but that's fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8cee0cad9b1930a057a24e095f0a655069ae8be2.1478111467.git.glauber@scylladb.com>
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.
If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
Since size estimates are stored as wrapped ranges, we call compat::wrap()
to convert from the now-standard unwrapped ranges back to wrapped ranges.
However, compat::wrap() relies on the ranges being in sorted order,
but our input is not. This leads to a crash as we find an unexpected
empty token in the middle of the vector.
Sort it so compat::wrap() works as expected.
Fixes#1804.
Message-Id: <1478161908-25051-1-git-send-email-avi@scylladb.com>
Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
We were ignored --smp option taken from io.conf since iotune didn't supported
it, but now it supported we can pass it.
(We need to pass it because we need to measure io performance on same condition
with scylla)
Fixes#1768
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478082591-27205-1-git-send-email-syuu@scylladb.com>
Under the hood, the selectable::add_and_get_index() function
deliberately filters out duplicate columns. This causes
simple_selector::get_output_row() to return a row with all duplicate
columns filtered out, which triggers and assertion because of row
mismatch with metadata (which contains the duplicate columns).
The fix is rather simple: just make selection::from_selectors() use
selection_with_processing if the number of selectors and column
definitions doesn't match -- like Apache Cassandra does.
Fixes#1367
Message-Id: <1477989740-6485-1-git-send-email-penberg@scylladb.com>
This patch adds an `update-version` script for updating the Scylla
version number in `SCYLLA-VERSION-GEN` file and committing the change to
git.
Example use:
$ ./scripts/update-version 1.4.0
which results into the following git commit:
commit 4599c16d9292d8d9299b40a3e44ef7ee80e3c3cf
Author: Pekka Enberg <penberg@scylladb.com>
Date: Fri Oct 28 10:24:52 2016 +0300
release: prepare for 1.4.0
diff --git a/SCYLLA-VERSION-GEN b/SCYLLA-VERSION-GEN
index 753c982..eba2da4 100755
--- a/SCYLLA-VERSION-GEN
+++ b/SCYLLA-VERSION-GEN
@@ -1,6 +1,6 @@
#!/bin/sh
-VERSION=666.development
+VERSION=1.4.0
if test -f version
then
Message-Id: <1477639560-10896-1-git-send-email-penberg@scylladb.com>
Fixes#1709.
* 'refresh-resilient-v3' of github.com:raphaelsc/scylla:
db: make refresh resilient to permission denied error
db: make it possible to use custom error handler with io checker
sstables: remove duplicated declaration of remove_by_toc_name
We use `data_resource` class in the CQL parser, which let's users refer
to a table resource without specifying a keyspace. This asserts out in
get_level() for no good reason as we already know the intented level
based on the constructor. Therefore, change `data_resource` to track the
level like upstream Cassandra does and use that.
Fixes#1790
Message-Id: <1477599169-2945-1-git-send-email-penberg@scylladb.com>
We store all auth perm strings in upper case, but the user might very
well pass this in upper case.
We could use a standard key comparator / hash here, but since the
strings tend to be small, the new sstring will likely be allocated in
the stack here and this approach yields significantly less code.
Fixes#1791.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <51df92451e6e0a6325a005c19c95eaa55270da61.1477594199.git.glauber@scylladb.com>
User may forget to set permission of new sstables in upload dir
before refreshing them, and that will result in shutdown.
io_checker is now able to work with a custom handler, so all we
have to do is to whitelist EACCES.
Fixes#1709.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By default, io checker will cause Scylla to shutdown if it finds
specific system errors. Right now, io checker isn't flexible
enough to allow a specialized handler. For example, we don't want
to Scylla to shutdown if there's an permission problem when
uploading new files from upload dir. This desired flexibility is
made possible here by allowing a handler parameter to io check
functions and also changing existing code to take advantage of it.
That's a step towards fixing #1709.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
EC2 paravirtual instances uses pv-grub, which refers /boot/grub/menu.lst (grub0.9x config file) instead of grub2 config file.
So add boot parameters on /boot/grub/menu.lst when the file exists, and the instance is on EC2.
Fixes#1598
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472056875-17512-1-git-send-email-syuu@scylladb.com>
wget is often used from scripts recording to logs; as it emits a log
line every second, the logs are huge and unreadable. Make it quieter.
Message-Id: <1477558534-32718-1-git-send-email-avi@scylladb.com>
"5ff699e09fcbd62611e78b9de601f6c8636ab2f0 ("row_cache: rework cache to
use fast forwarding reader") brought some significant changes to the
row cache implementation. Unfortunately, "significant changes" often
translates to "more bugs" and this time was no different.
This series contains fixes for the problems introduced in that rework
and makes failing dtest
bootstrap_test.py:TestBootstrap.local_quorum_bootstrap_test
pass again."
* 'pdziepak/cache-fixes/v1' of github.com:cloudius-systems/seastar-dev:
row_cache: avoid dereferencing invalid iterator
row_cache: set _first_element flag correctly
row_cache: fix clearing continuity flag at eviction
Conditions in row_cache::do_find_or_create_entry() make it possible that
std::prev(it) is going to be dereferenced even if it is a begin
iterator.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If the continuity flag was set for the first element _first_element flag
would not be cleared. This shouldn't cause any correctness problems but
properly setting the flag allows to avoid some unnecessary key
comparisons.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In original implementation the continuity flag indicated that cache has
full information about the range the between current partition and the
one following it, hence when evicting an entry the one preceeding it
had to have its continuity flag cleared.
This was changed, however, and now the continuiy flag tells whether the
cache is continuous between the current element and the one before it.
This means that eviction code needs to clear the flag for the entry
directly following the evicted one.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
* seastar 69acec1...98b5a2d (9):
> rpc: Silence warning about ignored failed future
> future: prioritise continuations that can run immediately
> iotune: relax aio restrictions
> build: support for static linking with boost
> rpc: Fix crash during connection teardown
> rpc: Move _connected flag to protocol::connection
> rpc test: fail test if exception is thrown during test execution
> rpc: do not assume underling semaphore type
> rpc: fix default resource limit
In boost 1.60, the executable's command-line arguments are expected to
be separated from the boost command-line arguments by '--'. Detect
this requirement and comply with it.
Message-Id: <1477212424-3831-1-git-send-email-avi@scylladb.com>
We calculate two sizes during the allocation: "size", which is the
in-segment size of this mutation, and "s", which is that plus the
overhead. cycle() must be called with the latter, not the former, as
doing otherwise may lead to buffer overflows.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ccf346d8d0ebb44a1ba9fd069653bab0d7be0a61.1477063157.git.glauber@scylladb.com>
"This patchset reworks the commitlog logic to better handle conditions in
which we are getting requests faster than the disk can handle. It does
this by building a wall around the commitlog and only allowing
allocations to proceed when we are under the desired memory threshold.
The main advantage of that is that we can now easily set the commitlog
to work at disk speed, more or less allowing an "one byte in for each
byte out" approach instead of depending on the current cycle to finish.
As a result, max latencies are greatly reduced.
Testing Results
===============
To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.
After 10 minutes running this load, we are left with the following
percentiles:
latency mean : 51.9 [WRITE:51.9]
latency median : 9.8 [WRITE:9.8]
latency 95th percentile : 125.6 [WRITE:125.6]
latency 99th percentile : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max : 2338.2 [WRITE:2338.2]
After this patch:
latency mean : 54.9 [WRITE:54.9]
latency median : 43.5 [WRITE:43.5]
latency 95th percentile : 126.9 [WRITE:126.9]
latency 99th percentile : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max : 471.4 [WRITE:471.4]
I have run this with larger sizes as well, and it generally performs
much better than the baseline version. For sizes up to 5MB, I have seen
no timeouts in my setup. After that, I see some timeouts. Buffer
splitting is expected to make this better.
Aside from performance testing, this was also tested with batch and
periodic mode for various requests sizes."
Current tracker for pending allocations is a queue_size GAUGE. Add a
total_operations version so we have more insight on what's going on.
It will be called requests_blocked_memory for consistency with other
subsystems that track similar things.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
scylla-blocktune is a script that parses scylla.yaml and tunes the data file
and commitlog directories it references.
Tuning includes:
- set the I/O scheduler to noop
- disable merging
- tune dependent devices (like RAID members)
Message-Id: <1476357027-15014-2-git-send-email-avi@scylladb.com>
The current chunk size of 256 gives a 50% probability of a 128k read or
write getting split into two accesses. This reduces efficiency and
increases latency.
Change the chunk size to 1MB, with a 12% probability of cross-member
access.
Message-Id: <1476269082-2473-1-git-send-email-avi@scylladb.com>
Since Debian 8(jessie) does not provides g++-5, we frequently got compile error
because we are using older compiler.
To fix the problem, backport g++-5 from Debian 9(stretch).
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-3-git-send-email-syuu@scylladb.com>
On Debian, lsb_release -r returns the version number something like '8.6'.
However, on this script we want to check major version only.
Therefore we can use VERSION_ID from /etc/os-release which only contains
major version number.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-2-git-send-email-syuu@scylladb.com>
Commit 7dcd70124a "tests/sstables: add
test for fast forwarding reader" added a test for skipping parts of
sstable. Unfortunately, it did not include the sstables it was trying to
read.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The current incarnation of commitlog establishes a maximum amount of
writes that can be in-flight, and blocks new requests after that limit
is reached.
That is obviously something we must do, but the current approach to it
is problematic for two main reasons:
1) It forces the requests that trigger a write to wait on the current
write to finish. That is excessive; ideally we would wait for one
particular write to finish, not necessarily the current one. That
is made worse by the fact that when a write is followed by a flush
(happens when we move to a new segment), then we must wait for
*all* writes in that segment to finish.
1) it casts concurrency in terms of writes instead of memory, which
makes the aforementioned problem a lot worse: if we have very big
buffers in flight and we must wait for them to finish, that can
take a long time, often in the order of seconds, causing timeouts.
The approach taken by this patch is to replace the _write_semaphore
with a request_controller. This data structure will account the amount
of memory used by the buffers and set a limit on it. New allocations
will be held until we go below that limit, and will be released
as soon as this happens.
This guarantees that the latencies introduced by this mechanism are
spread out a lot better among requests and will keep higher percentile
latencies in check.
To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.
After 10 minutes running this load, we are left with the following
percentiles:
latency mean : 51.9 [WRITE:51.9]
latency median : 9.8 [WRITE:9.8]
latency 95th percentile : 125.6 [WRITE:125.6]
latency 99th percentile : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max : 2338.2 [WRITE:2338.2]
After this patch:
latency mean : 54.9 [WRITE:54.9]
latency median : 43.5 [WRITE:43.5]
latency 95th percentile : 126.9 [WRITE:126.9]
latency 99th percentile : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max : 471.4 [WRITE:471.4]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In a subsequent patch, I'll use this code in a different place. To
prepare for that, we move it out as a method. It also fits a lot better
inside the segment manager, so move it there.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Goal is to calculate a size that is lesser or equal than the
segment-dependent size.
This was originally written by Tomasz, and featured in his submission
"commitlog: Handle overload more gracefully"
Extracted here so it sits clearly in a different patch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It is mostly an optimization, and while it makes sense in this context,
it won't soon as we'll stop waiting for the current cycle specifically
to finish.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We'll do that so we can, in following patches, use static members from
the segment. Those are not defined at this point.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We track the amount of pending allocations but we don't really export
it. It will be crucial when we stop tracking pending writes.
This patch exports it through a method instead of the totals structure,
so we can easily change it. Current code probing pending_allocations
(the api code) is also converted to use the public method instead of the
totals struct.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This patchset enables mutation readers to be fast forwarded to a different
partition range. The main reason for introducing such feature are range
queries served from cache. If the cache is partially populated in the
requested range the reader will end up with multiple subranges that have
to be read from the sstables. Originally, each of these subranges would
require a new reader to be created, but with fast forwarding we can have
just one sstable reader. This is better since there is a chance that buffers
kept by the reader may be still useful after fast forwarding it.
In this series there are also patches that clean up cache readers in order
to make integration with fast forwarding easier. Namely, continuity flag is
changed to store information about range before the entry which significantly
simplifies the logic.
Fixes #1299."
* 'pdziepak/fast-forward-mutation-readers/v5' of github.com:cloudius-systems/seastar-dev: (24 commits)
sstables: keep separate stream history for single and range reads
sstables: drop sstable::{lower, upper}_bound()
row_cache: rework cache to use fast forwarding reader
row_cache: put cache entry flags in a struct
row_cache: add do_find_or_create_entry() to reduce code duplication
mutation_reader: forward fast_forward_to() calls
tests/row_cache: add fast_forward_to() to throttled reader
tests/row_cache: count mutations read from _underlying
memtable: add support for fast_forward_to()
drop key readers
tests/mutation_reader: test fast forwarding combined reader
database: enable fast forwarding of range_sstable_reader
combined_mutation_reader: implement fast_forward_to()
mutation_reader: make combinded_reader public
tests/sstables: add test for fast forwarding reader
tests: add more helpers to mutation reader assertions
sstables: enable fast forwarding for range readers
mutation_reader: introduce fast_forward_to()
sstables: implement mutation_reader::impl::fast_forward_to()
sstables: introduce index_reader
...
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This uncomfortably large patch overhauls cache range reader so that it
can take advantage of fast forwarding mutation readers.
A significant change in the cache itself is that the continuity flag now
is used to determine whether cache is contiguous between the previous
entry and the current one. This allows for a significant simplification
of the cache code and easier integration with reader fast forwarding.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Flags are easier to manage if they are in a single structure.
Especially, default initialization and move contstructors are simpler
and less error prone.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Originally, cache tests checked how many times a mutation reader was
created from the underlying mutation source to determine whether
continuity flag is working correctly.
This is not going to work with fast forwarding mutation readers so the
test is switched to count number of mutations (+ end of stream markers)
returned from underlying mutaiton readers which is much less fragile.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Fast forwarding of memtable readers is needed only for unit tests which
often use memtables as underlying data source for cache and the cache
readers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When fast forwarding a reader that combines sstable reader we must also
remember that the set of sstables for the new range may be different
than for the previous one. The reader introduced in this patch makes
sure that we read from correct sstables.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
We want to be able to fast forward sstable readers. However, just
implementing fast_forward_to() for combined_reader is not enough as the
sstables we are reading from may need to change.
Following patches are going to introduce a combined sstable reader that
derives from combined_reader. To make that possible we first need to
make combined_reader public.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch introduces the interface for fast forwarding mutation
readers. The main user of this feature is going to be cache which, while
serving range query, may need to read multiple small ranges from the
sstables to populate itself with the missing entries.
Fast forwarding is an alternative to recreating a reader with different
range. Its main advantage is fact that it avoids dropping data that has
already been read.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
index_reader is a helper that implements index lookups. Its goal is to
avoid dropping read buffers if they still may be needed (for example to
get end bound of the range or after fast forwarding the reader).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
The move constructor of partition_version was not invoking move
constructor of anchorless_list_base_hook. As a result, when
partition_version objects were moved, e.g. during LSA compaction, they
were unlinked from their lists.
This can make readers return invalid data, because not all versions
will be reachable.
It also casues leaks of the versions which are not directly attached
to memtable entry. This will trigger assertion failure in LSA region
destructor. This assetion triggers with row cache disabled. With cache
enabled (default) all segments are merged into the cache region, which
currently is not destroyed on shutdown, so this problem would go
unnoticed. With cache disabled, memtable region is destroyed after
memtable is flushed and after all readers stop using that memtable.
Fixes#1753.
Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>
This patch uses cf_properties instead to add the missing attributes to
the create_view_statement class.
Fixes#1766
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the VIEWS element to the cause enum so we can
mark failures due to incomplete support of materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the definition of the default compressor into the
compression_parameters class, so that the table and view creation
statements don't have to explicitly deal with it.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the cf_properties class, which contains common
attributes of tables and materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since we are exiting Scylla process in engine().at_exit() using
::_exit(0), even verify_seastar_io_scheduler() throwing an exception,
scylla always exit with 0.
Systemd misunderstands scylla-server.service was shutdown successfully
because of this, so we need to pass correct exit code to ::_exit() here.
Fixes#1674
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>
* seastar 207bf3d...ccd8649 (3):
> Merge "Augment semaphore with non-blocking operations" from Glauber
> Merge "More dynamic fstream patches" from Paweł
> Merge "fstream: add dynamic adjustments based on stream history" from Paweł
A 1MB response will require 2000 allocations with the current 512-byte
chunk size. Increase it exponentially to reduce allocation count for
larger responses (still respecting the upper limit).
Message-Id: <1476369152-1245-1-git-send-email-avi@scylladb.com>
Memory accounting code was attaching partition_snapshot to
partition_entry in order to calculate the size of partition_version
object. However, it is only allowed if partition_entry doesn't have
any snapshot attached already. In this case it always has one, created
by the flushing reader.
Change the accounting code to reuse existing partition_snapshot reference.
Fixes#1746
Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.
This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.
Fixes#1394.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>
The expected behaviour in the scylla_setup script is that a question
will be followed by the answer.
For example, after asking if the scylla should be run as a service the
relevant actions will be taken before the following question.
This patch address two such mis-orders:
1. the scylla-housekeeping depends on the scylla-server, but the
setup should first setup the scylla-server service and only then ask
(and install if needed) the scylla-housekeeping.
2. The node_exporter should be placed after the io_setup is done.
Fixes#1739
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1476370098-25617-1-git-send-email-amnon@scylladb.com>
Change abstract_replication_strategy::create_replication_strategy() to
throw exceptions::configuration_error if replication strategy class
lookup to make sure the error is converted to the correct CQL response.
Fixes#1755
Message-Id: <1476361262-28723-1-git-send-email-penberg@scylladb.com>
* seastar f937fb0...207bf3d (11):
> Merge "iotune: gracefully exit on predictable exceptions" (Fixes#1623)
> core/semaphore: Add semaphore_units::release()
> Merge "rometheus API with grafana uses labels" from Amnon
> core/thread: Fix stack alloc-dealloc mismatch
> core/thread: Make jmp_buf_link::yield_at use the same time point as thread_scheduling_group
> file: support for XFS on older kernels
> reactor: fix bug when handling EBADF in flush_pending_aio()
> prometheus CPU should start in 0
> Collectd: bytes ordering depends on the type
> tests: Check that backtrace() doesn't corrupt signal mask
> core/thread: Add stack guards to seastar thread stacks
If we have a range query involving a wrapping range (i.e., from thrift),
and mutations from both halves of the result are involved, then
we will return the results in the wrong order (and potentially the wrong
partitions) since we order by token, so the results from the second half
of the wrapping range end up before the first.
Fix by splitting the two queries, and merging the second half with lower
priority compared to the first half.
Note: this will be fixed in a better way once we have the sharding iterator,
as then we can query sequentially.
Fixes#1761.
Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>
"This series address two issues that interfere with running the node_exporter as a service in ubuntu 16.
1. The service file should be packed in the deb file
2. When setting the node_exporter as a service it doesn't need to run with scylla use"
* 'amnon/node_exporter_ubuntu_v2' of github.com:cloudius-systems/seastar-dev:
node-exporter service: No need to run as scylla user
debian package: Include the node_exporter service file
the node-exporter does not need to run as scylla user. It can run
without scylla or without the scylla user being configure.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"This patch-set re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.
Ref #1139
Ref #693"
* 'describe-splits/v2' of github.com:duarten/scylla:
thrift: Implement describe_splits_ex based on Cassandra
storage_service: Implement get_splits() function
sstables: Add function to get key samples
sstables/key: Add to_partition_key function
size_estimates_recorder: Increase estimate accuracy
sstables: Get estimates for a particular range
sstables/key: Make key::kind public
The script mistakenly split value at "," when cpuset list is separated
by comma. Instead of matching possible patterns of the argument, let's
pass all characters until reach to space delimiter or end of line.
Fixes#1716
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476171037-32373-1-git-send-email-syuu@scylladb.com>
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.
Ref #1139
Ref #693
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the get_splits() function in storage_service,
used to split a particular token range in slices of approximately the
specified size, using the sample keys and estimates of the CF's
sstables.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the get_key_samples() function, on which a
future patch will base an implementation of the describe_splits()
thrift verb closer to Cassandra's.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the estimated_keys_for_range() function, which
estimates the number of keys present between the specified range.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"The version is taken from the installation rather than the API, a mode command
line indicated that this is part of the setup and uuid is used for the
interaction with the checkversion server."
* 'amnon/check_version_on_startup_v3' of github.com:cloudius-systems/seastar-dev:
scylla_setup: Check and report the scylla version
scylla-housekeeping: check version during setup
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.
total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.
Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
Presents current heap profile recording.
Works in text mode or dumps to collapsed stacks format from which
flame graph can be generated.
To generate a flamegraph:
(gdb) scylla heapprof --flame
Wrote heapprof.stacks
$ flamegraph.pl --colors mem < heapprof.stacks > heapprof.svg
flamegraph.pl comes from:
https://github.com/brendangregg/FlameGraph.git
Text mode example:
(gdb) scylla heapprof --min 100000000
All (274699676, #10213)
\-- void* memory::cpu_pages::allocate_large_and_trim<memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}>(unsigned int, memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}) + 169 (268435456, #1)
memory::allocate_large_aligned(unsigned long, unsigned long) + 87
memory::allocate_aligned(unsigned long, unsigned long) + 48
aligned_alloc + 9
logalloc::segment_zone::segment_zone() + 304
logalloc::segment_pool::allocate_segment() + 477
logalloc::segment_pool::segment_pool() + 304
__tls_init.part.801 + 72
logalloc::region_group::release_requests() + 1333
logalloc::region_group::add(logalloc::region_group*) + 514
The branches are formatted like this:
-- <symbol> (<size>, #<count>)
Where <size> is total size of live objects and <count> is total
number of live objects, for all objects allocated from paths going
through this node.
Nodes which share the same <size> and <count> are stacked like this:
-- <symbol_1> (<size>, #<count>)
<symbol_2>
<symbol_3>
Message-Id: <1475583334-19524-1-git-send-email-tgrabiec@scylladb.com>
Limiting the concurrency of memtable flushes to 4 was a temporary
workaround for the fact that we lacked good write behind support. Now
that write behind is properly merged we can reduce the concurrency to
what it should be, one.
This means that memtable flushes will now be serialized, and only when
one of them ends will the next one begin. Disk parallelism is obtained
through the write-behind mechanism.
Fixes#1373
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.
This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.
The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.
Fixes#1741.
Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range. If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.
"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.
This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
Commit log replay is a synchronous operation in bootstrap, so services
will only be started after it's completed. By starting compaction before,
less bandwidth will be available to both and consequently boot will be
slowed down. Fix is simply about moving compaction, which is an
asynchronous operation after commitlog replay is over.
Fixes#1620.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d2a173a4ee4d474317b970c6b39530e61067fea9.1475527955.git.raphaelsc@scylladb.com>
This patch adds the parsing for the "CREATE MATERIALIZED VIEW" statement,
following Cassandra 3 syntax. For example:
CREATE MATERIALIZED VIEW building_by_city
AS SELECT * FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);
It also adds the "IS NOT NULL" operator needed for this purpose.
As in Cassandra, "IS NOT NULL" can only be used for materialized
view creation, and not in a normal SELECT. It can only be used with
the NULL operand (i.e., "IS NOT 3" will be a syntax error).
The current implementation of this statement just does some sanity
checking (such as to verify that "city" is a valid column name and that
the "building" base table exists), complains that materialized views are
not yet supported:
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="Failed parsing statement: [CREATE MATERIALIZED VIEW building_by_city AS
SELECT * FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);] reason: unsupported operation: Materialized views not yet supported">
As mentioned above, the "IS NOT NULL" restriction is not allowed in
ordinary selects not creating a materialized views:
SELECT * FROM buildings WHERE city IS NOT NULL;
InvalidRequest: code=2200 [Invalid query] message="restriction 'city IS NOT null' is only supported in materialized view creation"
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1475742927-30695-1-git-send-email-nyh@scylladb.com>
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.
Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.
If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.
One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.
The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.
Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
"Description:
============
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Results
=======
With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.
In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.
For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ
Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed. I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.
Known Issues:
-------------
A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.
After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."
* 'virtual-dirty-v6' of github.com:glommer/scylla:
database: allow virtual dirty memory management
streamed_mutation: make _buffer private
add accounting of memory read to partition_snapshot_reader
move partition_snapshot_reader code to header file
LSA: allow a group to query its own region group
memtables: split scanning reader in two
sstables: use special reader for writing a memtable
LSA: export information about object memory footprint
LSA: export information about size of the throttle queue
database: export virtual dirty bytes region group
* seastar 18f7bb8...f937fb0 (5):
> Merge "Fix signal mask corruption" from Tomasz
> core/memory: Avoid violating strict aliasing when accessing allocation sites
> core/memory: Avoid indirection when storing allocation sites
> core/memory: Add a way to disable abort on allocation failure in some scope
> core/sharded: Allow mapper to take the service by non-const reference
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It is currently protected, but now all users go through
push_mutation_fragment(). So we can safely move its visibility to guarantee
that it stays that way.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
By default, we don't do any accounting. By specializing this class and providing
an accounter class, we can account how much memory are we reading as we read
through the elements.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This is so we can template it without worrying about declaring the
specializations in the .cc file.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The code that is common will live in its own reader, the iterator_reader. All
friendly private access to memtable attributes and methods happen through the
iterator reader.
After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We allocate objects of a certain size, but we use a bit more memory to hold
them. To get a clerer picture about how much memory will an object cost us, we
need help from the allocator. This patch exports an interface that allow users
to query into a specific allocator to get that information.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In order to allow Scylla’s docker container to handle multiple network
interfaces, the start-scylla script was refactored:
- `$IP` is now called `$SCYLLA_LISTEN_ADDRESS`, so it is less likely to
be confused or interfere with other environment variables.
- `$SCYLLA_LISTEN_ADDRESS` now checks its value and also tries to
resolve a hostname, if no IP was set to it.
- `$SCYLLA_LISTEN_DEVICE` can now be set as environment variable and
contain any available NIC device name (e.g. `eth0`). The script
automatically retrieves the IP address from the device.
Usage:
1. With `$SCYLLA_LISTEN_ADDRESS` as IP:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=192.168.1.100 scylladb/scylla`
2. With `$SCYLLA_LISTEN_ADDRESS` as hostname:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=containername.network.lan scylladb/scylla`
3. With `$SCYLLA_LISTEN_DEVICE`:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_DEVICE=eth0 scylladb/scylla`
Message-Id: <20161003151230.67672-1-marius@twostairs.com>
Currently, the code responsible for calculating ranges for the next
request could produce a wrap-around partition range. For example, if the
original range was (unimportant, A] and the last partition key A then
the output range would be (A, A].
This patch adds checks to make sure that in such cases the range is
removed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475497244-2790-1-git-send-email-pdziepak@scylladb.com>
cassandra_exception::prepare_message() is called from derived classes'
constructors before the base cassnadra_exception object is constructed.
This is technically illegal but harmless. Fix by marking the function
static.
Found by clang.
"This patch set ensures we can correctly handle queries
where the minimum token is specified."
* 'min-token/v3' of github.com:duarten/scylla:
cql_query_test: Add test case for min/max token bounds
token_restriction: Deal with minimum tokens
partitioner: Parse token from bytes
This object, similarly to a global_schema_ptr, allows to dynamically
create the trace_state_ptr objects on different shards in a context
of the original tracing session.
This object would create a secondary tracing session object from the
original trace_state_ptr object when a trace_state_ptr object is needed
on a "remote" shard, similarly to what we do when we need it on a remote
Node.
Fixes#1678Fixes#1647
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>
This patch adds a call to the scylla-housekeeping check version during
setup, so a warning will be printed if a newer version is available.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This changes are for running scylla during setup.
It contains the following changes:
1. get the current version from the command line (as the syclla does not
run at this stage).
2. It support a mode parameter in the command line to indicate that we
running during the installation.
3. It accept an external uuid that will be used with all interaction
with the check_version server.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)
Fixes#1720.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.
Fixes#1719.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.
Refs #1446.
Fixes#1684.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
CQL server is supposed to throttle requests so that they don't
overflow memory. The problem is that it currently accounts for
request's memory only around reading of its frame from the connection
and not actual request execution. As a result too many requests may be
allowed to execute and we may run out of memory.
Fixes#1708.
Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>
This patch fixes a bug where queries such as the following are not
handled properly:
"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"
Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.
Ref #1139
Ref #693Fixes#1717
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar 5b7252d...9e1d5db (5):
> prometheus: prevent illegal prometheus names
> scollectd: raw_to_value should not use network order
> semaphore: Introduce get_units()
> core::scollectd: truncate the identifiers fields on a 63 characters boundary
> Merge "Fix ASAN errors in debug builds" from Tomasz
Before: the range is split only once, so it is split into 2 sub ranges
INFO 2016-09-29 15:52:43,625 [shard 0] repair - target_partitions=100, estimated_partitions=537, ranges.size=2,
range=(8993553141924659802, 8997061146192366917] ->
ranges={
(8993553141924659802, 8995307144058513359], (8995307144058513359, 8997061146192366917]}
After: the range is split mulitple times, resulting 16 sub ranges.
INFO 2016-09-29 15:55:07,934 [shard 0] repair - target_partitions=100, estimated_partitions=67, ranges.size=16,
range=(8993553141924659802, 8997061146192366917] ->
ranges={
(8993553141924659802, 8993772392191391496], (8993772392191391496, 8993991642458123191],
(8993991642458123191, 8994210892724854885], (8994210892724854885, 8994430142991586580],
(8994430142991586580, 8994649393258318274], (8994649393258318274, 8994868643525049969],
(8994868643525049969, 8995087893791781664], (8995087893791781664, 8995307144058513359],
(8995307144058513359, 8995526394325245053], (8995526394325245053, 8995745644591976748],
(8995745644591976748, 8995964894858708443], (8995964894858708443, 8996184145125440138],
(8996184145125440138, 8996403395392171832], (8996403395392171832, 8996622645658903527],
(8996622645658903527, 8996841895925635222], (8996841895925635222, 8997061146192366917]}
Without this patch, repair can do checksum with a range with a lot of
partitions, not the expected less than 100 partitions per checksum. This
can lead to unncessary data transfer since the checksum is too coarse.
For instacne, as above, if the checksum of 1 out of 537 partitions is
different, the whole 527 partitions will be synced.
Fixes#1613
Message-Id: <0775c20c485c105df5f10bd685048227f074c365.1475137029.git.asias@scylladb.com>
The snappy_compress() function expects the "compressed_length" parameter
to contain the actual output buffer length but now we're passing random
garbage from the stack.
Fixes#1711
Message-Id: <1475132127-316-1-git-send-email-penberg@scylladb.com>
* seastar 2b55789...5b7252d (3):
> Merge "rpc: serialize large messages into fragmented memory" from Gleb
> Merge "Print backtrace on SIGSEGV and SIGABRT" from Tomasz
> test_runner: avoid nested optionals
Includes patch from Gleb to adapt to seastar changes.
"This series improves repair by
1) using less streaming sessions
2) reducing unnecessary streaming traffic
3) fixing a hang during shutdown
See commit log for "repair: Reduce stream_plan usage", "repair: Reduce
unnecessary streaming traffic" and "streaming: Fail streaming sessions
during shutdown" for details.
Tested with repair_additional_test.py."
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.
This counter is named for consistency after similar counters from transport/.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, we export the region group where memtables are placed as dirty bytes.
Upcoming patches will optimistically mark some bytes in this region as free, a
scheme we know as "virtual dirty".
We are still interested in knowing the real state of the dirty region, so we
will keep track of the bytes virtually freed and split the counters in two.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Also make sure to not listen on the same exact address twice in case
listen_address == broadcast_address. Scylla configuration code does not
allow such thing to be configured, but better to be safe.
Message-Id: <20160927102316.GO32178@scylladb.com>
Print when the node will be removed from gossip membership, e.g.,
INFO 2016-09-27 08:54:49,262 [shard 0] gossip - Node 127.0.0.3 will be
removed from gossip at [2016-09-30 08:54:48]: (expire = 1475196888294489339,
now = 1474937689262295270, diff = 259199 seconds)
The expire time which is used to decide when to remove a node from
gossip membership is gossiped around the cluster. We switched to steady
clock in the past. In order to have a consistent time_point in all the
nodes in the cluster, we have to use wall clock. Switch to use
system_clock for gossip.
Fixes#1704
"The prometheus project and its sub project does not have RPM/DEB packaging yet,
but it does have binaries for download.
This series adds an installation script that download install and run as a
service the node_exporter. For os that uses systemd it has a spec file ready
that will be package with the system. For ubuntu a service file will be created
when running the installer.
After this series running node_exporter_install a node_exporter will be running
as a service on the machine."
In patch ac619820 (streaming: Switch to use make_streaming_reade), we
switched to use make_streaming_reader for streaming. In repair, the
checksum phases also uses a mutation reader. For the same reasons (no
pollution to row cache, bounded new data after the reader is created),
switch repair checksum calculation to use the make_streaming_reader too.
Fixes#382Fixes#1682
Message-Id: <9e0ecda861bb0b6f690da5e2378b208159ffa41c.1474933195.git.asias@scylladb.com>
From Asias:
With this series, streaming and repair are improved:
- streaming, repair will not pollute the row cache on the sender side
any more. Currently, we are risking evicting all the frequently-queried
partitions from the cache when an operation like repair reads entire
sstables and floods the row cache with swathes of cold data from they
read from disk.
- less data will be sent becasue the reader will only return existing
data before the point of the reader is created, plus bounded amount
of writes which arrive later. This helps reducing the streaming time
in the case new data is being inserted all the time while streaming is
in progress. E.g., adding a new node while there is a lot of cql write
workload.
Fixes#382 and #1682
Using make_streaming_reader for streaming on the sender side, it has
the following advantages:
- streaming, repair will not pollute the row cache on the sender side
any more. Currently, we are risking evicting all the frequently-queried
partitions from the cache when an operation like repair reads entire
sstables and floods the row cache with swathes of cold data from they
read from disk.
- less data will be sent becasue the reader will only return existing
data before the point of the reader is created, plus bounded amount
of writes which arrive later. This helps reducing the streaming time
in the case new data is being inserted all the time while streaming is
in progress. E.g., adding a new node while there is a lot of cql write
workload.
Fixes#382Fixes#1682
The make_streaming_reader returns a combined mutation reader reads
mutations from sstables and memtable. The memtable reader handles
memtable flushing automatically so no special handling is needed here.
It will be used by streaming soon.
When there's no free disk, it asks to select disks from empty list:
"Please select disks from following list:
type 'done' to finish selection. selected:"
We should avoid to ask it, abort RAID setup instead.
Fixes#1673
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1474429218-28382-1-git-send-email-syuu@scylladb.com>
When using multiple physical network interfaces, set this to true to
listen on broadcast_address in addition to the listen_address, allowing
nodes to communicate in both interfaces. Ignore this property if the
network configuration automatically routes between the public and
private networks such as EC2.
Message-Id: <20160921094810.GA28654@scylladb.com>
If the remote peers have the same checksum, we can only fetch from
one of the peer node instead of all of them since they all have the same
data anyway. No need to fetch from all of them.
In addition to above optimization, if the local peer has no data, we can
skip sending the data back to the remote peer. Due to the fact that all
the remote peers have the same checksum and local peer has no data, so
each and every remote peer has all the data. There is no need to merge
the remote data with local data and send back the merged data back to
remote peers.
Refs: #1617
Right now, we are using one stream_plan for each range of a column
family. This generates tons of stream_plans and stream_sessions. Each
stream_plan can transfer multiple ranges and column families. We can
use a single stream_plan to stream datas for multiple ranges and column
families, so that 1) overhead of stream_plan/session negotiation is
reduced 2) it is much easier to debug/monitor few stream_sessions
Fixes#1685
"When a node is decommissioned, its gossip state will not be removed from gossip
immediately. It will only be removed 3 days later which helps nodes that were
down when the node was decommissioned to know decommission later when they are
up again.
This series improves the logging to reduce confusion when a node tries to
talking to a decommissioned node. In addition, we now do not try to talk to the
decommissioned in the unreachable_endpoints gossip round.
Fixes#1615"
* tag 'asias/loggging_decommissioned_nodes/v1' of github.com:cloudius-systems/seastar-dev:
gossip: Make two log items debug level
gossip: Print node status when node is UP or DOWN
gossip: Ignore the node which is decommissioned in gossip round
gossip: Print convict debug info only when the node is alive
gossip: Add more timing log in add_expire_time_for_endpoint
streaming: Print on_remove and on_restart log when peer exists
streaming: Introduce has_peer in stream_manager
It is duplciated with "InetAddresss x.x.x.x is now UP" message.
INFO 2016-09-23 10:35:15,512 [shard 0] gossip - Node 127.0.0.1 has restarted, now UP, status = NORMAL
INFO 2016-09-23 10:35:15,513 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = NORMAL
Make the log a bit cleaner.
For example:
gossip - InetAddress 127.0.0.4 is now UP, status = NORMAL
gossip - InetAddress 127.0.0.3 is now DOWN, status = LEFT
gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
We print the following messages even if there is no stream_session with
that peer. It is a bit confusing.
INFO 2016-09-23 08:26:37,254 [shard 0] stream_session - stream_manager:
Close all stream_session with peer = 127.0.0.1 in on_restart
INFO 2016-09-23 08:26:37,287 [shard 0] stream_session - stream_manager:
Close all stream_session with peer = 127.0.0.3 in on_remove
Print only when the streaming session with the peer exists.
The fact that Seastar's semaphore has a default initializer of 1 if not
explicitly initialized is confusing and unexpected and recently lead to
two bugs. So ScyllaDB should not rely on this default behavior, and specify
the initial value of each semaphore explicitly.
In several cases in the ScyllaDB code, the explict initialization was
missing, and this patch adds it. In one case (rate_limiter) I even think
the default of 1 was a bit strange, and 0 makes more sense.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1474530745-23951-1-git-send-email-nyh@scylladb.com>
stdx::optional<T> uses quite elaborate std::enable_if_t magic to decide
whether the argument passed to its constructor should be used for a call
T constructor or stdx::optional<T> constructor.
Apparently, with GCC 6.2 having T constructor which accepts any type
confuses that magic and we end up with compile errors.
The solution is to have from_range() method that replaces that
constructor from range. There is also constructor that creates a key
from std::vector<bytes> so that code generated by IDL works as it did
before.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1474550971-15309-1-git-send-email-pdziepak@scylladb.com>
This adds the option to install node_exporter during setup.
The node_exporter export server information in the prometheus API.
It should be used when using the scylla prometheus API to get the server
information.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a service file for OS that supports systemd.
When started, it would run an already installed node_exporter or fail.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
node_exporter is a utility that export node information via prometheus
API. It takes care of host related metrics such as CPU and memory.
The install script, download the node_exporter binaries, create a link
in /usr/bin.
On OS with systemd supported it would enable and start the installed
service file to start as a service. On others (ubuntu) it would create a conf file and start it.
The installation should be done using sudo.
After a successful installation, the node_exporter would run as a
service.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
It is possible that endpoint_state_map does not contain the entry for
the node itself when collectd accesses it.
Fixes the issue:
Sep 18 11:33:16 XXX scylla[19483]: [shard 0] seastar - Exceptional
future ignored: std::out_of_range (_Map_base::at)
Fixes#1656
Message-Id: <8ffe22a542ff71e8c121b06ad62f94db54cc388f.1474377722.git.asias@scylladb.com>
Example:
(gdb) scylla ptr 0x601000480000
thread 1, large, LSA-managed
One can then use 'scylla lsa-segment 0x601000480000' to examine LSA
segment contents.
Benoît Canet points out that CQL messages are not always compressed
although compression is enabled by the driver. Turns out our CQL
compression negotiation is broken. We need to negotiate compression upon
STARTUP message and not rely on the incoming request to have the
compression bit enabled.
Fixes#1680
Message-Id: <1474366693-3001-1-git-send-email-penberg@scylladb.com>
The constructor was added in commit 7f3ce39 ("query_options: Add
constructor for batch mode options (multi-level)") but apparently it was
never actually implemented.
Spotted by CLion.
Message-Id: <1474303017-23383-1-git-send-email-penberg@scylladb.com>
The EXECUTE message encoding is different between CQL binary protocol
versions v1 and v2 (and later). Fix process_execute() to deserialize the
message as per the CQL binary protocol v1 specification:
Executes a prepared query. The body of the message must be:
<id><n><value_1>....<value_n><consistency>
where:
- <id> is the prepared query ID. It's the [short bytes] returned as a
response to a PREPARE message.
- <n> is a [short] indicating the number of following values.
- <value_1>...<value_n> are the [bytes] to use for bound variables in the
prepared query.
- <consistency> is the [consistency] level for the operation.
Fixes#1676
Message-Id: <1474287392-16792-1-git-send-email-penberg@scylladb.com>
leveled strategy uses heavily first and last decorated keys of a
sstable to get overlapping sstables in a given level. By storing
first and last decorated keys in sstable object, it's expected
that performance of leveled strategy (not compaction) will be
improved.
We will set first and last keys in sstable when either loading
or sealing it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0abca819454ab4c088541bb49714f1f6a7dc4f42.1473959677.git.raphaelsc@scylladb.com>
timeuuid_type_impl::compare_bytes is a "trichotomic" comparator (-1,
0, 1) while less() is a "less" comparator (false, true). The code
incorrectly returns c1 instead of c1 < 0 which breaks the ordering.
Fixes#1196.
Message-Id: <1473956716-5209-1-git-send-email-tgrabiec@scylladb.com>
On instances differenet then i2/m3/c3 we provide instructions to run
scylla_ip_setup. Running scylla_io_setup requires access to
/var/lib/scylla to crate a temporary file. To gain access to that
directory the user should run 'sudo scylla_io_setup'.
refs: #1645
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <4ce90ca1ba4da8f07cf8aa15e755675463a22933.1473935778.git.shlomi@scylladb.com>
* seastar 0303e0c...e534401 (6):
> Merge "enable rpc to work on non contiguous memory for receive" from Gleb
> install-dependencies.sh: install python3 for Ubuntu/Debian, which requires for configure.py
> fix tcp stuck when output_stream write more than 212992 bytes once.
> scripts/posix_net_conf.sh: supress 'ls: cannot access /sys/class/net/<NIC>/device/msi_irqs/' error message
> scripts/posix_net_conf.sh: fix 'command not found' error when specifies --cpu-mask
> native_network_stack: Fix use after free/missing wait in dhcp
Includes: "Remove utils::fragmented_input_stream and utils::input_stream in favor of seastar version" from Gleb.
From Duarte:
This patchset reuses the bound_view::comparator in range_tombstone to
correctly detect wrap around of a clustering range. This fixes a
manifestation of #1446 that results in wrong query results.
Introduced by b1f9688432Fixes#1669
Refs #1446
In row_cache::make_reader, we update statistics inside an
allocating_section, which retries the supplied function until it can
satisfy all allocations by way of reserving LSA memory up front. Since
those updates are interleave with allocations, retries can lead to
miscounts.
This patch fixes this by updating statistics after all allocations.
Fixes#1659
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1473845977-20205-1-git-send-email-duarte@scylladb.com>
Currently we get boost::lexical_cast on startup if inital_token has a
list which contains spaces after commas, e.g.:
initial_token: -1100081313741479381, -1104041856484663086, ...
Fixes#1664.
Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>
There are several places in types.cc where we assume that sstring_view
range is null terminated. That may be not true and we should always use
either begin()/end() or data()/size() pairs.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"This will be very important for read performance of time series use case,
where timestamp is usually stored as a clustering key, and the user asks
for specific data using a clustering range filter. Example:
CREATE TABLE temperature (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time)
);
...
SELECT * FROM temperature
WHERE weatherstation_id='1234ABCD'
AND event_time > '2013-04-03 07:01:00'
AND event_time < '2013-04-03 07:04:00';
This is based on: https://issues.apache.org/jira/browse/CASSANDRA-5514
To check correctness, I wrote a dtest that runs scylla with row cache disabled,
creates several sstables with non overlapping clustering key ranges, queries
data using several clustering range filters, and checks that the database
returns the expected results.
Tested performance with a tool I wrote myself [1] and performance is indeed
improved by this patchset. This tool works as follow:
Scylla is started with row cache disabled. That's wanted here because we're
measuring a specific code that only gets executed if row cache misses the data
we asked for. Then Scylla is populated node with N sstables ('nodetool flush'
is used to ensure it), where each will have M clustering keys, totaling N*M
clustering keys. Finally, we will start asking for data using a clustering
range filter. The tool measures throughput and min/max/avg latency.
[1]: https://gist.github.com/raphaelsc/4c415f592aaed14a18be31279d225972
Follow the results:
BEFORE
-----
('Clustering keys / second: ', 747.9672111659951)
('Max latency (ms): ', 33)
('Min latency (ms): ', 12)
('Avg latency (ms): ', 13.0)
The operation took 13.3695700169 seconds
AFTER
-----
('Clustering keys / second: ', 3159.115303945648)
('Max latency (ms): ', 22)
('Min latency (ms): ', 2)
('Avg latency (ms): ', 3.0)
The operation took 3.16544318199 seconds
NOTE: Throughput and average latency are improved by a factor of ~4.
-----"
This adds the GET and POST api for slow query logging.
The GET return an object with the enable, ttl and threshold and the POST
lets you configure each of them.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
That's needed to observe behavior of clustering filter, and to
check if it's worthwhile for a specific workload.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Leveled strategy will not benefit from this strategy because
there's only a few sstables that will contain a given partition
key, which means that a clustering key that belongs to a specific
partition key can only be in a few sstables as well.
Date tiered strategy is the one that will actually benefit the
most from this optimization. Size tiered may benefit from it too
if clustering key isn't overwritten, but it will not use the
clustering optimization.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If user specifies a clustering filter, it's possible to filter out
sstable based on its metadata that tracks min/max clustering value.
For example, if sstable stores clustering key from 'a' through 'c',
it's possible to filter out that sstable if user asks for data
with clustering key greater than 'c'.
That's done by comparing each component separately because
clustering key may be composite. Further information can be found
here: https://issues.apache.org/jira/browse/CASSANDRA-5514
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be important for sstable code that will rule out a sstable
if it doesn't cover a given clustering key range.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's akin to abstract_type::as_less_comparator's nature.
So we don't have to repeat something like the following everywhere:
auto cmp = [&type] (const bytes_view& b1, const bytes_view& b2) {
return type->compare(b1, b2); }
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
All sstables will now have bloom filter checked in a single pass
before reader iterate through all candidates. It's possible that
we will need to futurize the procedure if it holds cpu for too
long. This change is also a step towards the optimization that
will rule out sstables based on clustering filter.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Store range for each clustering component in sstable itself to
optimize sstable filtering based on clustering key.
If schema defines no clustering key, this new field will be
empty. Each range stores min and max value of that specific
component. With this information, it's possible to know if a
sstable possibly stores a given clustering component.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Scylla was generating a sstable with incorrect min max clustering
values. This information is used to filter out a sstable when user
asks for a range of clustering rows. So it's important to detect
wrong metadata and make sure that it will not be used.
The validation is fast and will only happen when loading a sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be needed for optimization that will store decorated keys
in the sstable object, and also for a subsequent work that will
detect wrong metadata (min/max column names) by looking at columns
in the schema. As schema is stored in sstable, there's no longer
a need to store ks and cf names in it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It's possible to copy sstables directly into vector, and that will
improve performance. my benchmark tool[1] shows that new version
reduces running time of *copy procedure* by factor of two after
1024^2 calls.
Switching to back_inserter improves throughput even further.
[1]: gist.github.com/raphaelsc/a4b27290f362cdecdef399770dda759c
Refs #1632.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <7153514a9b5f5eb24dff518ee9fa3680e0881dae.1472741401.git.raphaelsc@scylladb.com>
This reverts commit 1726b1d0cc.
Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.
Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.
The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.
Conflicts:
row_cache.hh
Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch makes the optional trace_state_ptr arguments introduced in
previous patches mandatory where possible. Functions which are called
internally don't have a trace context, so for those we keep the
argument's default value for convenience.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the storage_proxy so it passed along a
trace_state_ptr to the layers below, when querying locally or
receiving a remote query request.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the database and column_family types so a
trace_state_ptr can be passed in when querying. This enables tracing
of the inner components.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the row_cache so it accepts a trace_state_ptr,
which it is responsible of flowing to the underlying mutation_reader
if needed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the mutation_reader so it optionally accepts a
trace_state_ptr. This will allow us to trace, for example, which
sstables are accessed during a request.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Nodetool cfhistograms is supposed to tell us how many SSTables were touched per
read. Currently, we are a bit in the dark as we don't export that information.
This patch exports that, so that we can start using it."
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
That is done for single partition queries only - mimicking what
Cassandra does on that matter.
For this to be correct, we also need to update this histogram on cache
hits - in which case we update the read as having touched 0 SSTables. That
will be done on a separate patch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The make_reader method is currently a const method, but we would like to start
keeping hit statistics from it.
Instead of relaxing the const condition too much, we can just mark the _stats
field as mutable, indicating that make_reader will not be able to change
anything in the CF, except for keeping statistics.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There is nothing really that fundamentally ties the estimated histogram to
sstables. This patch gets rid of the few incidental ties. They are:
- the namespace name, which is now moved to utils. Users inside sstables/
now need to add a namespace prefix, while the ones outside have to change
it to the right one
- sstables::merge, which has a very non-descriptive name to begin with, is
changed to a more descriptive name that can live inside utils/
- the disk_types.hh include has to be removed - but it had no reason to be
here in the first place.
Todo, is to actually move the file outside sstables/. That is done in a separate
step for clarity.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Sometimes the user would like to dump all the metrics into a file or
pipe it to another program, as requested in issue #1506.
This patch makes scyllatop check if stdout is connected to a TTY,
and if not - it does not fire up the fancy urwid UI but instead, just
writes all it's collected metrics to stdout.
Optionally, the user tell the program to quit after a specific
number of iterations via the -n or --iterations flag
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1471777516-9903-1-git-send-email-yoav@scylladb.com>
"clustering_key_filtering_context is no longer needed.
partition_slice can be used instead so this series removes
clustering_key_filtering_context and passes partition_slice down where
it's needed. Then a static get_ranges method is used to obtain
clustering key ranges for a given partition.
Fixes #1614."
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This fixes the problem of multiple concurrent get_ranges calls.
Previously each call was invalidating the result of the previous
call. Now they don't step on each other foot.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
On posix_net_conf.sh's single queue NIC mode (which means RPS enabled mode), we are excluded cpu0 and it's sibling from network stack processing cpus, and assigned NIC IRQ to cpu0.
So always network stack is not working on cpu0 and it's sibling, to get better performance we need to exclude these cpus from scylla too.
To do this, we need to get RPS cpu mask from posix_net_conf.sh, pass it to scylla_cpuset_setup to construct /etc/scylla.d/cpuset.conf when scylla_setup executed.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472544875-2033-2-git-send-email-syuu@scylladb.com>
Right now scylla_prepare specifies -mq option to posix_net_conf.sh when number of RX queues > 1, but on posix_net_conf.sh it sets NIC mode to sq when queues < ncpus / 2.
So the logic is different, and actually posix_net_conf.sh does not need to specify -sq/-mq now, it autodetects queue mode.
So we need to drop detection logic from scylla_prepare, let posix_net_conf.sh to detect it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472544875-2033-1-git-send-email-syuu@scylladb.com>
The CQL type IDs are specified as hex in the CQL binary protocol
specification. Define CQL type IDs in the code explicitly to make
reviewing the code and adding new types easier.
Message-Id: <1472537971-26053-1-git-send-email-penberg@scylladb.com>
Alexandr Porunov reports that Scylla fails to start up after reboot as follows:
Aug 25 19:44:51 scylla1 scylla[637]: Exiting on unhandled exception of type 'std::system_error': Error system:99 (Cannot assign requested address)
The problem is that because there's no dependency to network service,
Scylla simply attempts to start up too soon in the boot sequence and
fails.
Fixes#1618.
Message-Id: <1472212447-21445-1-git-send-email-penberg@scylladb.com>
"This series introduces a "slow query logging" feature that
allows logging the queries that take more than a specified
threshold time to complete.
Once such a query detected, it will be logged in a system_traces.node_slow_log table.
In addition all trace for that query that have been collected on a Coordinator
are going to be written as well.
If the handling time on a replica in the context of a query takes more than (the same) threshold
they are going to be written too.
The raw in a node_slow_log contains a session_id of a corresponding tracing session,
thereby allowing the user to query the system_traces tables for the corresponding trace
records.
The schema of the node_slow_log table is as follows:
CREATE TABLE system_traces.node_slow_log (
node_ip inet,
shard int,
session_id uuid,
date timestamp,
start_time timeuuid,
command text,
duration int,
parameters map<text, text>,
source_ip inet,
table_names set<text>,
username text,
PRIMARY KEY (start_time, node_ip, shard))
WITH default_time_to_live = 86400
where
- node_ip: IP of the coordinator Node.
- shard: shard ID on a Coordinator where the query was handled.
- session_id: ID of a corresponding tracing session.
- date: a time when the query has began.
- start_time: a time-based UUID for this query (needed for a primary key mostly).
- command: a query string.
- duration: a time it took to handle this query (in microseconds).
- parameters: a map of query parameters (like in system_traces.sessions).
- source_ip: IP of a Client that sent this query.
- table_names: a set of "<keyspace>.<table name>" strings representing column
families used in this query.
- username: a user name used for this query.
The good thing is that most of the data we needed is already
collected by the regular tracing framework. The only missing ones
are a username and tables' names. So, this series makes the framework collect them too.
The whole feature is integrated in the Tracing framework. The main
changes to the framework that were made are as follows:
- Store the constant capabilities of the tracing session in an enum_set, e.g.:
- primary/secondary.
- write on close.
- Introduce two new capabilities to a tracing session of a specific query:
- full tracing: collect all traces for this query (as it is before this series).
- log slow query: log this query if its duration is above the threshold.
These two capabilities may be defined independently.
- Add the logic that handles the "log slow query"-only case:
- Build the parameters<sstring, sstring> map only if the "duration" is above
the given threshold.
- The same about writing the trace entries.
- In a not-only "log slow query" case:
- Write the node_slow_log entry.
- Extend the trace_info struct to pass slow query threshold and TTL to the replica
Node.
In addition to above this series add the capability to configure the slow query logging
threshold and a TTL for the node_slow_log records.
The heaviest patch in the series is the last one. The series contains a few cosmetic (renaming)
patches that are meant to align the naming of the existing methods with the ones the last one
is going to add."
Now that mutation handler knows how much time is left for mutation
write to be handled it can use this knowledge to set correct timeout
for forwarded mutations.
Message-Id: <20160828080637.GE9243@scylladb.com>
Write-behind allows a single sstable write to saturate the disk,
improving throughput. Later we can take advantage of this to reduce
the number of sstables being written concurrently.
The main idea is to log queries that take "too long" to complete.
The "too long" is above the given threshold.
To achieve the above this patch does the following:
- Introduce two new properties to the tracing::trace_state:
- "Full tracing": when the tracing of this query was explicitly requested.
In this state we will record all possible traces related to this query:
both on the coordinator and on any replica involved.
- "Log slow query": when slow query logging is enabled.
If slow query logging is enabled and a session's "duration" is above
the specified threshold we will create a record in the "slow queries log"
and write all trace records created on the coordinator and on a replica
if a replica's session lasts longer than that threshold.
(We will propagate the Coordinator's slow query logging threshold to replicas
in the context of a specific tracing/logging session).
The properties above are independent, namely they may be enabled and/or disabled
independently and any combination of them is legal (naturally, creating a tracing
session when both states above are disabled makes no sense).
- Instrument the tracing::tracing service to allow the following:
- Enable/disable slow query logging.
- Set/get the slow query duration threshold (in microseconds).
- Set/get the slow query log record TTL value (in seconds).
- Instrument the trace_keyspace_helper to write a slow query log entry
when requested.
- The slow query logging is disabled by default and the threshold is set to half a second.
- The TTL of a slow log record is set to 86400 seconds by default.
- It makes sense to use the same "slow query logging threshold" and a "slow query record TTL"
both on a coordinator and on a replica Nodes in a context of the same tracing session:
- Pass both TTL and a threshold to the replica in a trace_info.
This patch also implements the new slow query logging specific logic:
- Don't write the pending tracing records before the end of a tracing session
until "duration" reaches the logging threshold.
- Don't build the parameters<sstring, sstring> map unless we know we will write it
to I/O.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
* seastar 6fadd98...ef063c5 (2):
> rpc: pass a timeout to a verb's server handler if the one was specified by a client
> rpc: cleanup the old metaprogramming craft
Reversed iterators are adaptors for 'normal' iterators. These underlying
iterators point to different objects that the reversed iterators
themselves.
The consequence of this is that removing an element pointed to by a
reversed iterator may invalidate reversed iterator which point to a
completely different object.
This is what happens in trim_rows for reversed queries. Erasing a row
can invalidate end iterator and the loop would fail to stop.
The solution is to introduce
reversal_traits::erase_dispose_and_update_end() funcion which erases and
disposes object pointed to by a given iterator but takes also a
reference to and end iterator and updates it if necessary to make sure
that it stays valid.
Fixes#1609.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1472080609-11642-1-git-send-email-pdziepak@scylladb.com>
It's always true and clustering_key_filtering_context is
going away so the first step is to get rid of this method.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We have API for getting pending compaction tasks both in column
family and compaction manager. Column family is already returning
pending tasks properly.
Compaction manager's one is used by 'nodetool compactionstats', and
was returning a value which doesn't reflect pending compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <a20b88938ad39e95f98bfd7f93e4d1666d1c6f95.1471641211.git.raphaelsc@scylladb.com>
- Instead of keeping separate booleans introduce a trace_state_props_set enum_set and
pass it around instead of separate booleans.
- Change the trace_info to hold this value in addition to write_on_close. Initialize
a corresponding bit in an enum_set based on a write_on_close value in a trace_info
constructor for a backward compatibility.
- Separate a trace_state constructor into two:
- For a primary session object.
- For a secondary session object.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Normally, the check version should start and stop with the scylla-server
service.
If it fails to find scylla server, there is no need to check the
version, nor to report it, so it can stop silently.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
There is a problem with Python SSL's in Ubuntu 14.04:
ubuntu@ip-10-81-165-156:~$ /usr/lib/scylla/scylla-housekeeping -q version
Traceback (most recent call last):
File "/usr/lib/scylla/scylla-housekeeping", line 94, in <module>
args.func(args)
File "/usr/lib/scylla/scylla-housekeeping", line 71, in check_version
latest_version = get_json_from_url(version_url + "?version=" + current_version)["version"]
File "/usr/lib/scylla/scylla-housekeeping", line 50, in get_json_from_url
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 1] _ssl.c:510: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure>
Instead of using Python libraries to connect to the check version
server, we will use curl for that.
Fixes#1600
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
To work around an SSL problem with Python on Ubuntu 14.04, we need to
use curl. Add it as a dependency so that it's available on the host.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This table is going to be used to store information about queries
which are slower than a specified threshold.
Also added a column caching and mutation creation functions
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- "username" is a name used in the authentication process.
- "table name" is a <keyspace>.<cf name> string representing a name
of a table used for a query in question.
Note that there may be more than one table name in a batch query. Therefore
we store an unordered set of tables names.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Avoid sorting (and creating a new one) container at a backend code when a sorted
container is needed.
The overhead for the backends where it's not needed is minimal since the size of the
map is very small.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
.in is the name for template files witch requires to rewrite on building time, but these systemd unit files does not require rewrite, so don't name .in, reference directly from .spec.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1471607533-3821-1-git-send-email-syuu@scylladb.com>
Size estimates for a particular column family are recorded every 5
minutes. However, when a user calls the describe_splits(_ex) verbs,
they may want to see estimates for a recently created and updated
column family; this is legitimate and common in testing. However, a
client may also call describe_splits(_ex) very frequently and
recording the estimates on every call is wasteful and, worse, can
cause clients to give up. This patch fixes this by only recording
estimates if the first attempt to query them produces no results.
Refs #1139
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1471900595-4715-1-git-send-email-duarte@scylladb.com>
After state_processor().process_state() returns proceed::no the upper
layer should have a chance to act before more data is pushed to the
consumer. This means that in case of proceed::no verify_end_state()
should not be called immediately since it may invoke
consume_end_partition().
Fixes#1605.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1471943032-7290-1-git-send-email-pdziepak@scylladb.com>
This patch uses the clustering bounds comparator to correctly detect
wrap around of a clustering range. This fixes a manifestation of #1446,
introduced by b1f9688432, where a query
such as select * from cf where k = 0x00 and c0 = 0x02 and c1 > 0x02
would result in a range containing a clustering key and a prefix,
incorrectly ordered by the prefix equality or lexicographical
comparators.
Refs #1446
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Move scylla-server and scylla-jmx supervisord config files to separate
files and make the main supervisord.conf scan /etc/supervisord.conf.d/
directory. This makes it easier for people to extend the Docker image
and add their own services.
Message-Id: <1471588406-25444-1-git-send-email-penberg@scylladb.com>
As part of the move to unwrap ranges, don't generate wrapping ranges from
thrift. A little extra motivation is to avoid the need for the solution
to #1573 to be able to handle wrapping ranges.
This patch may also be fixing a bug in that the range (token, token] was
previously translated as (-inf, +inf), while now it is translated as
{(token, +inf), (-inf, token]}; the new translation respects ordering
better.
Reviewed-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1471869587-12972-1-git-send-email-avi@scylladb.com>
"This series switches frozen_mutations and to use bytes_ostream
internally so that the size of a single allocation is bounded.
Deserializers are also enhanced so that they can cope with reading
from fragmented buffers.
The goal of the change is to reduce memory pressure in case of
large partitions.
Performance as measured by perf_simple_query (median of 30).
before after diff
read 705270.74 702906.35 -0.3%
write 814504.81 836462.33 +2.7%
Refs #1440.
Refs #1545.
Fixes #1546."
Unlike bytes, bytes_ostream supports fragmented buffers, thus reducing
the pressure on the memory allocator caused by large frozen partitions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Deserialization code is going to use a proxy object that will be casted
to either bytes or bytes_ostream depending on the demand. It cannot be
casted directly to bytes_view though as it won't extend the lifetime of
the buffer appropriately. The simples solution is just to add overloads
that accept const bytes&.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Deserialization code has now two variants. The faster one can be used
only when the source buffer is not fragmented. reduce_chunk_count() aims
to increase number of cases when the fast path can be used.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch makes append() and write() limit the maximum size of a single
allocation to bytes_ostream::max_chunk_size.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
input_stream performs a type erasure on seastar::simple_input_stream and
fragmented_input_stream. The main goal is to keep the overhead for the
cases when simple_input_stream is used minimum.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
fragmented_input_stream is an input stream usable by IDL-generated
deserializers which can read from fragmented buffers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
It is easier for user to figure out the configuration error.
The log looks like:
WARN 2016-08-22 15:04:56,214 [shard 0] gossip - ClusterName mismatch
from 127.0.0.2 test2!=test
WARN 2016-08-22 15:06:16,106 [shard 0] gossip - Partitioner mismatch from 127.0.0.2
org.apache.cassandra.dht.RandomPartitioner!=org.apache.cassandra.dht.Murmur3Partitioner
Fixes: #1587
Message-Id: <745ed8857da6f70745735b94eef7b226d2f22e10.1471849834.git.asias@scylladb.com>
The condition in question is sanity check for a SW bug.
This SW bug (if occurs) is not critical - there is an additional protection
against it in the stop_foreground_and_write().
Having said all that, since we shell not throw from a destructor,
replace throwing of a std::logic_error with an logger error message.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1471773320-7398-1-git-send-email-vladz@cloudius-systems.com>
Glauber "eagle eyes" Costa pointed out that the Scylla logo used in our
Docker image documentation looks broken because it's missing the Scylla
text.
Fix the problem by using the Scylla mascot instead.
Message-Id: <1471525154-2800-1-git-send-email-penberg@scylladb.com>
The bug tracker URL in our Docker image documentation is not clickable
because the URL Markdown extracts automatically is broken.
Fix that and add some more links on how to get help and report issues.
Message-Id: <1471524880-2501-1-git-send-email-penberg@scylladb.com>
allow user to use the `supervisorctl' program to start and stop
services. `exec` needed to be added to the scylla and scylla-jmx starter
scripts - otherwise supervisord loses track of the actual process we
want to manage.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1471442960-110914-1-git-send-email-yoav@scylladb.com>
"This series includes a time stamp representation changes Avi asked.
In addition is fixes a session "duration" semantics to be the time
it took to satisfy the user's request and not a time it took to
achieve the complete replication factor."
* seastar 823a404...81df893 (3):
> memory: Do not increase g_allocs on failure in allocate and allocate_aligned
> memory: Balance the g_frees and g_allocs
> Merge "thread: explicitly yield on get()" from Glauber
Fixes#1586.
Once unlink_leftmost_without_rebalance() has been called on a bi::set no
other method can be used. This includes clear_and_disposed() used by the
mutation_partition destructor.
We like unlink_leftmost_without_rebalance() because it is efficient, so
the solution is to manually finish destroying clustering row and range
tombstone sets in the reader destructor using that function.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
mutation_fragment() constructor allocates memory. If it fails the
already unlinked parts of mutation (either rows_entry or range_tombtone)
will be leaked.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The WorkingDirectory directive does not support environment variables on
systemd version that is shipped with Ubuntu 16.04. Fortunately, not
setting WorkingDirectory implicitly sets it to user home directory,
which is the same thing (i.e. /var/lib/scylla).
Fixes#1319
Signed-of-by: Benoit Canet <benoit@scylladb.com>
Message-Id: <1470053876-1019-1-git-send-email-benoit@scylladb.com>
We have two counters that tracks how many memtable flushes are in progress, and
how much memory are they pinning.
The problem is, after we have revamped the code to limit the amount of flushes
in progress, those counters became useless: as they live inside the semaphore
side, they will only be incremented once we have past the semaphore.
One wouldn't notice if working with CPU-bound problems, where memtables don't
pile. But as soon as they do, those counters will always show the same numbers:
the depth of the semaphore, which doesn't mean much. The problem is poised to
become much worse: once we enable write behind in full and set the semaphore's
depth to one, that's the number we'll see here all the time.
The fix is to move the counters outside the semaphore, which will bring back its
old semantics.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c5ae6903e170f3f356cdda7ed78a4c9ba8d5f024.1471370504.git.glauber@scylladb.com>
A session's "duration" should be a time it took to
handle a request, which is a time till response to a user.
In other words - till a consistency level is reached.
Before this patch is was a time that takes a complete
handling of a request, which is the time it takes to handle
all replicas and not only those required to reach a CL.
This patch fixes this situation by extending the trace_state's state
values to 3 states: inactive, foreground and background.
A primary session may be in 3 states:
- "inactive": between the creation and a begin() call.
- "foreground": after a begin() call and before a
stop_foreground_and_write() call.
- "background": after a stop_foreground_and_write() call and till the
state object is destroyed.
- Traces are not allowed while state is in an "inactive" state.
- The time the primary session was in a "foreground" state is the time
reported as a session's "duration".
- Traces that have arrived during the "background" state will be recorded
as usual but their "elapsed" time will be greater or equal to the
session's "duration".
Secondary sessions may only be in an "inactive" or in a "foreground"
states.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Define an tracing::elapsed_clock type (std::chrono::steady_clock).
Use it instead of trace_state::clock_type.
- Store the "elapsed" information in a form of elapsed_clock::duration.
- Make all keyspace_backend specific conversions inside the trace_keyspace_helper
class, where they belong.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
events_records are promised to be kept alive till the future returned
by apply_events_mutation() resolves: it's dowithificated by a caller already.
In addition, since its passed by a reference, it's a logical thing to demand
it to be kept alive by a caller till the future above resolves.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
"This series changes the tracing back pressure scheme from limiting the amount traces in
a single session by a fixed number to have a per-shard budget consumed by all active tracing
sessions.
It was really easy to cause the traces to be dropped even if there weren't too many
active traces: e.g. if there was a single active session which creates more traces
than a per-session limit (30) the traces above 30-th were going to be dropped. Namely
traces were dropped when there were only 30 active traces, which is ridiculous.
This series introduces two main changes:
- Changes the records budgeting from being per-session to be per-shard. This substantially
increases the amount of active records after which new records are going to be dropped.
- Introduces a flow when events' records are written BEFORE the corresponding tracing
session is over (right now traces are written to I/O back end only when the session object
is destroyed).
The later is meant to virtually eliminate the traces drops in normal situations at all.
Of course, if a back end is slow or if there are a lot of small sessions that do not complete we would still have
to drop new sessions/records in order to avoid uncontrolled growth of a memory foot print of Tracing.
If we see the later case happening a lot in the future we may add lowres timers to each session that would
commit the cached records for writing every X time. But let's not try to optimize something that we
are not completely sure has to be optimized... "
The histogram implementation uses sampling to estimate the mean and sum.
This patch adds a method that returns an estimated sum based on the mean
and the total number of events measured.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1467547341-30438-2-git-send-email-amnon@scylladb.com>
Currently, query_processor.cc code formatting is all over the place,
which makes the file hard to read. Apply some formatting magic to make
it prettier.
Message-Id: <1470832486-26020-2-git-send-email-penberg@scylladb.com>
"Ranges that wrap around are a source of complexity and bugs. This patchset
adds a nonwrapping_range class, which specifies the range can't wrap around.
It is the user of the nonwrapping_range that is required to enforce this
constraint.
The idea is to incrementaly disallow ranges that wrap around. We do it
for query::clustering_range in this patchset, and it can be done similarly
for other ranges. This moves the burden of unwrapping ranges to the edges.
Fixes#1544"
This patch changes the type of query::clustering_range to express that
ranges that wrap around are not allowed, and ranges that have the
start bound after the end bound are considered empty.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes the storage_proxy return an empty result when the
query doesn't define any clustering ranges (default or specific).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes make_clustering_range not enforce that the range be
non-wrapping, so that it can be validated differently if needed. A
make_clustering_range_and_validate function is introduced that keeps
the old behavior.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the nonwrapping_range class. This class is
intended to be used by code that requires non wrapping ranges.
Internally, it uses a wrapping_range. Users are responsible for
ensuring the bounds are correct when creating a nonwrapping_range.
The path proposed here is to incrementally replace usages of
wrapping_range/range by nonwrapping_range, pushing usages of wrapping
ranges as further to the edges as possible.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch renames range to wrapping_range in preparation for adding a
new range type, nonwrapping_range.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Currently, partition snapshot destructor can throw which is a big no-no.
The solution is to ignore the exception and leave versions unmerged and
hope that subsequent reads will succeed at merging.
However, another problem is that the merge doesn't use allocating
sections which means that memory won't be reclaimed to satisfy its
needs. If the cache is full this may result in partition versions not
being merged for a very long time.
This patch introduces partition_snapshot::merge_partition_versions()
which contains all the version merging logic that was previously present
in the snapshot destructor. This function may throw so that it can be
used with allocating sections.
The actual merging and handling of potential erros is done from
partition_snapshot_reader destructor. It tries to merge versions under
the allocating section. Only if that fails it gives up and leaves them
unmerged.
Fixes#1578Fixes#1579.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1471265544-23579-1-git-send-email-pdziepak@scylladb.com>
$ tools/scyllatop/scyllatop.py '*gossip*'
node-1/gossip-0/gauge-heart_beat_version 1.0
node-2/gossip-0/gauge-heart_beat_version 1.0
node-3/gossip-0/gauge-heart_beat_version 1.0
Gossip heart beat version changes every second. If everyting is working
correctly, the gauge-heart_beat_version output should be 1.0. If not,
the gauge-heart_beat_version output should be less than 1.0.
Message-Id: <cbdaa1397cdbcd0dc6a67987f8af8038fd9b2d08.1470712861.git.asias@scylladb.com>
[v2: fix check for static column (don't check if the schema is not compound)
and move want-static-columns flag inside the filtering context to avoid
changing all the callers.]
When a CQL request asks to read only a range of clustering keys inside
a partition, we actually need to read not just these clustering rows, but
also the static columns and add them to the response (as explained by Tomek
in issue #1568).
With the current code, that CQL request is translated into an
sstable::read_row() with a clustering-key filter. But this currently
only reads the requested clustering keys - NOT the static columns.
We don't want sstable::read_row() to unconditionally read the from disk
the static columns because if, for example, they are already cached, we
might not want to read them from disk. We don't have such partial-partition
cache yet, but we are likely to have one in the future.
This patch adds in the clustering key filter object a flag of whether we
need to read the static columns (actually, it's function, returning this
flag per partition, to match the API for the clustering-key filtering).
When sstable::read_row() sees the flag for this partition is true, it also
request to read the static columns.
Currently, the code always passes "true" for this flag - because we don't
have the logic to cache partially-read partitions.
The current find_disk_ranges() code does not yet support returning a non-
contiguous byte range, so this patch, if it notices that this partition
really has static columns in addition to the range it needs to read,
falls back to reading the entire partition. This is a correct solution
(and fixes#1568) but not the most efficient solution. Because static
columns are relatively rare, let's start with this solution (correct
by less efficient when there are static columns) and providing the non-
contiguous reading support is left as a FIXME.
Fixes#1568
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1471124536-19471-1-git-send-email-nyh@scylladb.com>
When the housekeeping configuration name was changed from conf to cfg it
was no longer included as part of the conf rpm.
This change adds a macro that determines of if the file should be
included or not and use that marco to conditionally add the
configuration file to the rpm.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1471169042-19099-1-git-send-email-amnon@scylladb.com>
Files with a conf extension are run by the scylla_prepare on the AMI.
The scylla-housekeeping configuration file is not a bash script and
should not be run.
This patch changes its extension to cfg which is more python like.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470896759-22651-2-git-send-email-amnon@scylladb.com>
maybe_flush_pi_block, which is called for each cell, assumes that
block_first_colname will be empty when the first cell is encountered
for each partition.
This didn't hold after writing partition which generated no index
entry, because block_first_colname was cleared only when there way any
data written into the promoted index. Fix by always clearing the name.
The effect was that the promoted index entry for the next partition
would be flushed sooner than necessary (still counting since the start
of the previous partition) and with offset pointing to the start of
the current partition. This will cause parsing error when such sstable
is read through promoted index entry because the offset is assumed to
point to a cell not to partition start.
Fixes#1567
Message-Id: <1470909915-4400-1-git-send-email-tgrabiec@scylladb.com>
The dist flag mark the debian package as distributed package.
As such the housekeeping configuration file will be included in the
package and will not need to be created by the scylla_setup.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470907208-502-2-git-send-email-amnon@scylladb.com>
Add '--smp', '--memory', and '--overprovisioned' options to the Docker
image. The options are written to /etc/scylla.d/docker.conf file, which
is picked up by the Scylla startup scripts.
You can now, for example, restrict your Docker container to 1 CPU and 1
GB of memory with:
$ docker run --name some-scylla penberg/scylla --smp 1 --memory 1G --overprovisioned 1
Needed by folks who want to run Scylla on Docker in production.
Cc: Sasha Levin <alexander.levin@verizon.com>
Message-Id: <1470680445-25731-1-git-send-email-penberg@scylladb.com>
Scylla was tracking min and max column names instead. Min and max
clustering components are tracked to optimize reads that use a
clustering filter. For more details:
https://issues.apache.org/jira/browse/CASSANDRA-5514
Also fix potential bug if clustering value is empty.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently write events are issued every time a trace session is closed.
However if a single session creates a lot of events we will start dropping them
after the total amount of pending records bypasses the limit.
This patch will issue a write event before the session end in that case.
Since now new events may be added to the active tracing session while it's
scheduled for write we have to ensure the following:
- Not to add the already pending for write session to the pending bulk.
- Grab all pending data in a specific session in a synchronous way during
the write event.
- Serialize creation of events mutations - otherwise the "monotonic nanos"
logic won't work.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Use a per-shard tracing records budget instead
of maintaining a fixed-size per-session records budget and
a per-shard sessions budget.
The original policy could lead to some irrational situations,
when we have a single tracing session that creates a substantial
amount of records that we can handle but we would start dropping
new records after it surpasses the per-session limit.
The new policy handles a per-shard trace records budget that is
being consumed by each trace() call and by a primary session destructor
when a session record is created.
Each active record may only be in one of the following states:
- cached: stored in its session's object. When record is in this state
it's not going to be written to I/O during the next write event.
- pending for write: when record is in this state it's going to be written
to I/O during the next write event.
- flushing: the record is being currently written to the I/O.
There are counters of the total amount of records in each state above.
Each record may only be in a specific state at every point of time and
thereby it must be accounted only in one and only one of the three
counters.
The sum of all three counters should not be greater than
(max_pending_trace_records + write_event_records_threshold) at any time
(actually it can get as high as a value above plus (max_pending_sessions)
if all sessions are primary but we won't take this into an account for
simplicity).
The same is about the number of outstanding sessions: it may not be greater
than (max_pending_sessions + write_event_sessions_threshold) at any time.
If total number of tracing records is greater or equal to the limit
above, the new trace point is going to be dropped.
If current number or records plus the expected number of trace records
per session (exp_trace_events_per_session) is greater than the limit
above new sessions will be dropped. A new session will also be dropped if
there are too many active sessions.
When the record or a session is dropped the appropriate statistics
counters are updated and there is a rate-limited warning message printed
to the log.
Every time a number of records pending for write is greater or equal to
(write_event_records_threshold) or a number of sessions pending for
write is greater or equal to (write_event_sessions_threshold) a write
event is issued.
Every 2 seconds a timer would write all pending for write records
available so far.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When building events' mutation don't apply them in a tight loop
but rather apply each of them in a separate continuation to allow
reactor to interrupt this loop if it takes too long for it to
complete (e.g. where there are a lot of mutations to apply).
Since building all events' mutations is asynchronous now we can
no longer keep the "nanos" state in a global trace_keyspace_helper
object but rather have to move it into the per-session
backend_session_state class.
backend_session_state class is a backend-specific implementation of a
tracing::backend_session_state_base class.
An instance of the above object is created by a
tracing::i_tracing_backend_helper::allocate_session_state() virtual
method and is stored in a tracing::one_session_records object.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Before this patch the interaction between the layers above was as follows:
- trace_state was passing the trace event data to a backend object every
time trace() method was called.
- trace_state was passing the session data to a backend object in a destructor.
- A backend object was storing this data in a form of lambda where all data
above was caught in a capture list. This was primarily done in order to
delay the call for make_xxx_mutation(). Lambdas were stored in a map by a session
ID and they were executed when a kick() method was called.
- A tracing::tracing object was periodically calling a kick() method of a
backend that was initiating a write of all pending data to the storage.
All backend methods used in the described above interactions were virtual.
Thereby, for instance, for each and every trace record we were calling a virtual method that was
receiving a significant amount of parameters, store a lambda in a map and return.
This is clearly a suboptimal way of using virtual functions since we prevent a compiler
from inlining an obviously inlinable operations.
This patch changes the interaction scheme to be as follows:
- Trace events and session data are stored and passed around in a form of structs
that hold all relevant information (no more lambdas).
- As long as a trace session is active its data is aggregated inside the corresponding
trace_state object.
- The object containing all records is passed and stored as a lw_shared_ptr to save extra
copies and to shorten capture lists.
- All aggregated data is passed to a tracing::tracing object in a trace_state destructor.
The data is stored in a std::deque in a tracing::tracing object (instead of a map by a session ID).
- A single backend's virtual method call writes all data aggregated so far (kick()
method is not needed any more), every time a write event occurs.
- Backend has only one virtual method now:
- Write a bulk of sessions' data aggregated so far.
- Backend's virtual method receives a records bulk object by reference.
As a result:
- A latency of a single trace event that has no formatting improved from 0.2us to 0.1us.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
The sanitizer of the debug build warns when a "bool" variable is read when
containing a value not 0 or 1. In particular, if a class has an
uninitialized bool field, which class logic allows to only be set later,
then "move"ing such an object will read the uninitialized value and produce
this warning.
This patch fixes four of these warnings seen in sstable_test by initializing
some bool fields to false, even though the code doesn't strictly need this
initialization.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com>
This patch sets the default validator for dynamic column families.
Doing so has no consequences in terms of behavior, but it causes the
correct type to be shown when describing the column family through
cassandra-cli.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470739773-30497-1-git-send-email-duarte@scylladb.com>
"The series adds an optional configuration file to the scylla-housekeeping. The
file act as a way to prevent the scylla-housekeeping to run. A missing
configuration file, will make the scylla-housekeeping immediately.
The series adds a flag to the build_rpm that differentiate between public
distributions that would contain the configuration file and private
distributions that will not contain it which will cause the setup script to
create it."
The promoted-index reading code contained a bug where it copied the value
of an disengaged optional (this non-value was never used, but it was still
copied ). Fix it by keeping the optional<> as such longer.
This bug caused tests/sstable_test in the debug build to crash (the release
build somehow worked).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470742418-8813-1-git-send-email-nyh@scylladb.com>
This patch ensures we always send the column metadata, even when the
column family is dynamic and the metadata is empty, as some clients
like cassandra-cli always assume its presence.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470740971-31169-1-git-send-email-duarte@scylladb.com>
Commit 0d8463aba5 broke some of the tests with an assertion
failure about local_is_initialized(). It turns out that there is more than
one level of local_is_initialized() we need to check... For some tests,
neither locals were initialized, but for others, one was and the other
wasn't, and the wrong one was tested.
With this patch, all unit tests except "flush_queue_test.cc" pass on my
machine. I doubt this test is relevant to the promoted index patches,
but I'll continue to investigate it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com>
"
While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.
Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.
Flush queue can now (optionally) propagate exceptions to all clients, and
commit log uses this to ensure that commit log writers in batch mode
all generate exceptions on disk errors.
Also includes some rudimentary tests for flush queue mechanisms.
Note: other main user, sstable flushing, is not affected, as default
mode is still to keep exceptions to individual worker continuations,
not waiters."
"The goal of this patch series is to support reading and writing of a
"promoted index" - the Cassandra 2.* SSTable feature which allows reading
only a part of the partition without needing to read an entire partition
when it is very long. To make a long story short, a "promoted index" is
a sample of each partition's column names, written to the SSTable Index
file with that partition's entry. See a longer explanation of the index
file format, and the promoted index, here:
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
There are two main features in this series - first enabling reading of
parts of partitions (using the promoted index stored in an sstable),
and then enable writing promoted indexes to new sstables. These two
features are broken up into smaller stand-alone pieces to facilitate the
review.
Three features are still missing from this series and are planned to be
developed later:
1. When we fail to parse a partition's promoted index, we silently fall back
to reading the entire partition. We should log (with rate limiting) and
count these errors, to help in debugging sstable problems.
2. The current code only uses the promoted index when looking for a single
contiguous clustering-key range. If the ck range is non-contiguous, we
fall back to reading the entire partition. We should use the promoted
index in that case too.
3. The current code only uses the promoted index when reading a single
partition, via sstable::read_row(). When scanning through all or a
range of partitions (read_rows() or read_range_rows()), we do not yet
use the promoted index; We read contiguously from data file (we do not
even read from the index file, so unsurprisingly we can't use it)."
In this unit test, we create using Scylla C++ code, the same large
partition with 13520 CQL rows as we previously imported from Cassandra
for the large partition test. We then verify that the sstable index file
we just wrote is byte-for-byte identical to the one previously created by
Cassandra. They should indeed be identical, because the data file has the
same layout (even if timestamps are different) and our default promoted-
index block size is the same (64K) so the sample of columns should be
identical.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds writing of promoted index to sstables.
The promoted index is basically a sample of columns and their positions
for large partitions: The promoted index appears in the sstable's index
file for partitions which are larger than 64 KB, and divides the partition
to 64 KB blocks (as in Cassandra, this interval is configurable through
the column_index_size_in_kb config parameter). Beyond modifying the index
file, having a promoted index may also modify the data file: Since each
of blocks may be read independently, we need to add in the beginning of
each block the list of range tombstones that are still open at that
position.
See also https://github.com/scylladb/scylla/wiki/SSTables-Index-FileFixes#959
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add to the range_tombstone_accumulator a range_tombstones_for_row(ck)
method.
Just like the existing tombstone_for_row(ck), this function drops from
the accumulator tombstones that end before ck. But while the existing
function returned just a single tombstone affecting the given row (the
most recent tombstone), the new function range_tombstones_for_row(ck)
returns all the accumulated range tombstones which cover ck.
This function will be useful for the promoted-index writing code later,
which divides a partition into blocks which may be read independently,
so each block needs to start with a repeat of the earlier tombstones
which still cover the first row in the new block.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds support more efficiently reading small parts of a large
partition, without reading the entire partition as we had to do so far.
This is done using the "promoted index".
The "promoted index" is stored in the sstable index file, and provides
for each large sstable row ("partition" in CQL nomenclature) a sample of
the column names at (for example) 64KB intervals. This means that when we
read a slice of columns (e.g., cql rows), or page through a large partition,
we do not have to read the entire partition from disk.
This patch only implements the read side of promoted index - a later patch
will add the write-side support (i.e., writing the promoted index to the
index file while saving the sstable). Nevertheless this patch can already
be tested by reading existing sstables from Cassandra which include a
promoted index - such as the one included in the test in the previous patch.
The use of the promoted index currently has two limitations:
1. It is only used when reading a single partition with sstable::read_row(),
not when scanning through many partitions with sstable::read_range_rows()
or sstable::read_rows().
2. It is only used when filtering a single clustering-key range, rather
than a list of disjoint ranges. A single range is the common case.
These two issues will be improved later. In the meantime, in those
unsupported cases we simply continue to read entire partitions, so we're not
worse-off than before.
Also note that this patch only helps when sstable::read_row() is used with
a clustering-key prefix (i.e., a slice). Our higher-level request handling
code may decide to read an entire partition into the cache, and not use
a clustering-key prefix at all when reading. We will need to indepdently
improve the high-level code to use read_row()'s slicing capabilities
when paging through large partitions, for example.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our sstable reading code is currently hard-coded to read entire partitions,
even if we know that only a subset of the columns are requested.
This patch introduces find_disk_ranges(), a function to find the ranges of
bytes we need to read from the sstable data file to guarantee that the
desired columns from the desired partition are read.
The returned range may be the entire byte range of the given partition -
as found using the summary and index files - but if the index contains a
"promoted index" (basically a sample of column positions for each key)
we may return a smaller range. The "disk_read_range" type introduced in
the previous patch is extended here to support reading a partial partition -
by including additional information which would be missed when reading only
part of a partition (viz., the partition key and the partition's tombstone).
This function isn't used in this patch - we will wire its use in the next
patch, which will complete the read-side support for the promoted index.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our index_entry type, holding one partition's entry that we read from the
index file, already contained the "_promoted_index" which we read from
disk - as an unparsed byte buffer. But there wasn't any API to access
this buffer after it was read. This patch adds a trivial getter, to get
a read-only view of this buffer.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "struct column" code in partition.cc is generally useful code for
parsing serialized column names from the sstable. It is currently private
inside the "mp_row_consumer" class. But in a next patch we'll also want
to use it in the "sstable" class, for the promoted-index parsing code,
which among other things also needs to deserialize column names.
The trivial fix, in this patch, is to make this code "public". However,
for now it is still available only in partition.cc.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, the main sstable data parsing entry point data_consume_rows()
takes a contiguous range of bytes to read from disk and parse. This range
is supposed to be an entire partition or contiguous group of partitions.
and is self contained (can be parsed without extra information about the
identity of these partitions).
For the promoted index feature (which we will add in a following patch)
we will want the range to span only a part of a partition, and will need
the caller to provide some information not available to the parser (such
as the partition's key). In the future, we will also want to support a
vector of byte ranges, instead of just one.
So in preparation for this, this patch simply replaces the start/end pair
by a new class disk_read_range, which can be easily extended in later
patches. No new functionality is introduced in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In a later patch adding "promoted index" read support, we would like to
parse only part of an sstable row. In that case, the parser should start
not at the usual ROW_START state, but rather at the ATOM_START state.
But there's a problem: The sstable parser consumer currently assumes that
the parser stops after the start of the row, before reading any atoms.
So in the partial row case too, we must stop parsing before reading the
first atom.
For this, this patch adds the new "STOP_THEN_ATOM_START" parser state.
When starting in this state, the parser stops immediately (with
row_consumer::proceed::no), and when restarted again it will be in the
ATOM_START case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test that takes an sstable with one partition of 13,520
clustering rows (spanning 700 KB in the data file), and attempts to read
various slices CQL rows, counting that we got back the expected number
of rows.
The sstable included here was generated by Cassandra, and includes a
promoted index. Promoted index reading is not supported yet (we will
add it in the next patch), so for now the code will always read the
entire partition from disk; But still the clustering-key filtering is
already functional, and will drop some of the rows as requested,
so this test will pass.
Later, when we add promoted index support, we should check that this test
still passes - promoted index will make the reads in this test more
efficient (which the test cannot verify), but the important thing to check
is that it doesn't break any of these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Adding the configuration file is a way to make the running
scylla-housekeeping service not run the check version.
If the file does not exists, it will be set by the setup script.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The new dist flag difrentiate between public distribution and private
compilations.
For public distributation the housekeeping configuration file will be
installed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When running the scylla_setup, the script would check that the
housekeeping configuration file: housekepping.conf exists.
If not, it would ask the user if to run the check version option or not
and create the file accordingly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The housekeeping.conf is a configuration file for scylla-housekeeping.
By default it will be included in the rpm and state that the
check-version would be run.
If the file is missing, or if check-version is set to false, the check
version operation will not be performed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds an optional config file that can be passed from the
command line.
If the config file is specified and does not exist, the script will
terminate.
The only parameter that is currently available is check-version that can
be either true or false.
The ConfigParser module is used to read the config file, that should be
on the form of:
[housekeeping]
check-version: True
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"The compact column is a dense schema's single regular column. Its
existence has been a source of bugs, so this patchset removes the
column_kind::compact_column, as well as further references to compact
columns from the code base.
Fixes#1542"
"Kubernetes is unhappy with our Docker image because we start systemd
under the hood. Fix that by switching to use "supervisord" to manage the
two processes -- "scylla" and "scylla-jmx":
http://blog.kunicki.org/blog/2016/02/12/multiple-entrypoints-in-docker/
While at it, fix up "docker logs" and "docker exec cqlsh" to work
out-of-the-box, and update our documentation to match what we have.
Further work is needed to ensure Scylla production configuration works
as expected and is documented accordingly."
Calls like later() and with_gate() may allocate memory, although that is not
very common. This can create a problem in the sense that it will potentially
recurse and bring us back to the allocator during free - which is the very thing
we are trying to avoid with the call to later().
This patch wraps the relevant calls in the reclaimer lock. This do mean that the
allocation may fail if we are under severe pressure - which includes having
exhausted all reserved space - but at least we won't recurse back to the
allocator.
To make sure we do this as early as possible, we just fold both release_requests
and do_release_requests into a single function
Thanks Tomek for the suggestion.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>
Fix Docker Hub documentation to match what we have right now.
More work is needed in the following areas:
* How to make a cluster
* How to configure Docker image for production use
We configure the hostname in the "CQLSH_HOST" environment variable but
that is only picked up if we first start the shell. Setup the hostname
in $HOME/.cqlshrc file instead so that we can start "cqlsh" directly:
docker exec -it scylla cqlsh
We don't have systemd running on the image so "journalctl" is useless.
Log to stdout instead which has the nice benefit of making "docker logs"
produce meaningful output on the host.
Issue 1510 describes a scenario in which, under load, we allocate memory within
release_requests() leading to a reentry into an invalid state in our
blocked requests' shared_promise.
This is not easy to trigger since not all allocations will actually get to the
point in which they need a new segment, let alone have that happening during
another allocator call.
Having those kinds of reentry is something we have always sought to avoid with
release_requests(): this is the reason why most of the actual routine is
deferred after a call to later().
However, that is a trick we cannot use for updating the state of the blocked
requests' shared_promise: we can't guarantee when is that going to run, and we
always need a valid shared_promise, in a valid state, waiting for new requests
to hook into.
The solution employed by this patch is to make sure that no allocation
operations whatsoever happen during the initial part of release_requests on
behalf of the shared promise. Allocation is now deferred to first use, which
relieves release_requests() from all allocation duties. All it needs to do is
free the old object and signal to the its user that an allocation is needed (by
storing {} into the shared_promise).
Fixes#1510
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>
This is a confusing one, and can be replaced the fact that dense
schemas have a single regular column.
Ref #1542
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A compact column is a dense schema's single regular column. The fact
that it is a different column_kind has lead to various bugs (#1535,
derived by the schema being dense and the column being regular.
Fixes#1542
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Fixes: #1490
While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.
Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.
Thus, adding a muation to commit log in batch, will now, if an error
is generated, result in an exception to the caller, which should be
interpreted as "data might not have been persisted".
The failing segment is then closed, and we happily hope things will
get better in the next. Which they probably wont.
Missing: registration of some sort of "error-handling policy", similar
to origin, which can either kill transports or shut down process.
(A reasonable guess is that disk errors in commit log are not gonna
be recoverable).
Re-worked to use shared_promise<> as signal mechanism
(because we have that now), which also makes it less painful
to implement exceptions propagating not only from "func" to
"post", but also from given func->post chain entry to any
waiters.
v2:
* Remove leading "_" in template types
Useful for triggerring core dump on allocation failure inside LSA,
which makes it easier to debug allocation failures. They normally
don't cause aborts, just fail the current operation, which makes it
hard to figure out what was the cause of allocation failure.
Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>
This patch adds the prometheus API it adds the proto library to the
compilation, adds an optional configuration parameter to change the
prometheus listening port and start the prometheus API in main.
To disable the prometheus API, set its listening port to 0.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470231628-22831-2-git-send-email-amnon@scylladb.com>
This patch adds the prometheus API it adds the proto library to the
compilation, adds an optional configuration parameter to change the
prometheus listening port and start the prometheus API in main.
To disable the prometheus API, set its listening port to 0.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470228764-19545-2-git-send-email-amnon@scylladb.com>
data_consume_rows_context needs to have close() called and the returned
future waited for before it can be destroyed. data_consume_context::impl
does that in the background upon its destruction.
However, it is possible that the sstable is removed before
data_consume_rows_context::close() completes in which case EBADF may
happen. The solution is to make data_consume_context::impl keep a
reference to the sstable and extend its life time until closing of
data_consume_rows_context (which is performed in the background)
completes.
Side effect of this change is also that data_consume_context no longer
requires its user to make sure that the sstable exists as long as it is
in use since it owns its own reference to it.
Fixes#1537.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1470222225-19948-1-git-send-email-pdziepak@scylladb.com>
Switch to supervisord to manage the two processes we have: Scylla server
and Scylla JMX proxy. We need this to make the Docker image run under
Kubernetes, which now fails as follows as we try to start the systemd
init process:
Couldn't find an alternative telinit implementation to spawn.
I have not seen other people hitting the issue, except for GitLab Docker
image:
https://gitlab.com/gitlab-org/gitlab-ce/issues/18612
which "solved" the problem by not running init...
https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/838/diffs
Furthermore, the "supervisord" approach seems to be what people actually
use in Docker land:
http://blog.kunicki.org/blog/2016/02/12/multiple-entrypoints-in-docker/
The only downside is that we now sort of duplicate functionality that's
already in the systemd configuration files. However, we should work
towards Scylla figuring out its configuration rather than compose a long
list of command line arguments. Once we do that, the duplication in
Docker supervisord scripts disappears.
"This patch series ensures we don't count dead partitions (i.e.,
partitions with no live rows) towards the partition_limit. We also
enforce the partition limit at the storage_proxy level, so that
limits with smp > 1 works correctly."
With this patch we stop counting dead partitions (i.e., partitions
containing only tombstones) towards the partition limit, which
should apply only to partitions with live rows.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
With this patch we stop counting dead partitions (i.e., partitions
containing only tombstones) towards the partition limit, which should
apply only to partitions with live rows.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This reverts commit 141ea49e05.
There was a confusion around the meaning of "partition limit".
Parts of our code interpreted it just as "maximum number of partitions".
This is also how Cassandra behaves.
However, the other parts of the code, including data query, interpreted
it as "maximum number of live partitions" or otherwise skipped dead
partitions resulting in #1447.
A decision has been made to stick to the "maximum number of live
partitions" interpretation everywhere. The consequences are, among
others, that the patch reverted by this one is no longer correct.
While, the actual series fixing the interpretations of partition limit
and getting rid of the confusion is yet to come, the purpose of this
revert is to make backporting easier (as the patch being reverted
hasn't made it to branch-1.3 yet).
This patch ensures that when the schema is dense, regardless of
compact_storage being set, the single regular columns is translated
into a compact column.
This fixes an issue where Thrift dynamic column families are
translated to a dense schema with a regular column, instead of a
compact one.
Since a compact column is also a regular column (e.g., for purposes of
querying), no further changes are required.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470062410-1414-1-git-send-email-duarte@scylladb.com>
This series adds the ability for partition cache to keep information
whether partition size makes it uncacheable. During, reads these
entries save us IO operations since we already know that the partiiton
is too big to be put in the cache.
First part of the patchset makes all mutation_readers allow the
streamed_mutations they produce to outlive them, which is a guarantee
used later by the code handling reading large partitions.
Inherit the alignment parameters from the underlying file instead of
defaulting to 4096. This gives better read performance on disks with 512-byte
sectors.
Fixes#1532.
Message-Id: <1470122188-25548-1-git-send-email-avi@scylladb.com>
"schema_altering_statement is both a cql_statement and a
prepared_statement.
This makes it hard to understand because virtual functions from both
base classes are present, and hard to separate the raw and prepared
variants.
This patchset removes schema_altering_statement::prepare() (and the
enable_shared_from_this<> that makes it work) in preparation for
splitting its subclasses into raw and prepared variants (note that
create_table_statement was already split)."
Fixes errors like the one below:
(gdb) scylla memory
Python Exception <class 'gdb.error'> A syntax error in expression, near `memory::cpu_mem'.:
Error occurred in Python command: A syntax error in expression, near `memory::cpu_mem'.
Wrapping the symbol in quotes instructs GDB to lookup in the global
context instead of the context of current frame.
Message-Id: <1470050751-3167-1-git-send-email-tgrabiec@scylladb.com>
There was no way to setup correct repo when AMI is building by --localrpm option, since AMI does not have access to 'version' file, and we don't passed repo URL to the AMI.
So detect optimal repo path when starting build AMI, passes repo URL to the AMI, setup it correctly.
Note: this changes behavor of build_ami.sh/scylla_install_pkg's --repo option.
It was repository URL, but now become .repo/.list file URL.
This is optimal for the distribution which requires 3rdparty packages to install scylla, like CentOS7.
Existing shell scripts which invoking build_ami.sh are need to change in new way, such as our Jenkins jobs.
Fixes#1414
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1469636377-17828-1-git-send-email-syuu@scylladb.com>
This series handle two issues:
* Moving to python2, though python3 is supported, there are modules that we
need that are not rpm installable, python3 would wait when it will be more
mature.
* Check version should send the current version when it check for a new one and
a simple string compare is wrong.
In preparation for splitting raw and prepared variants of subclasses
of schema_altering_statement, remove schema_altering_statment::prepare().
All subclasses already implement it themselves (by creating a new instance).
In preparation for the removal of schema_altering_statement::prepare(),
add a fake create_table_statement::prepare(). create_table_statement
has already been split to raw and prepared variants, so this prepare()
will never be called, but it is required because schema_altering_statement
is both a cql_statement and a prepared_statement. This confusion will
be fixed later on.
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
To prepare for the separation of prepared and raw schema_altering_statements,
avoid the reliance this class implementing both the raw and prepared
variants and the use of shared_from_this().
Once we encounter a wide partition store information
about this in cache entry and don't try to read it all
and cache next time it's requested.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[Paweł: rebased, moved large partition reading logic to
cache_entry::read_wide()]
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In case of failure BOOST_REQUIRE_EQUAL() is nicer and prints the actual
values that were supposed to be equal, but aren't.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
There were cases of use-after-free introduced by the code responsible
for creating mutation_readers for large partitions – the lifetimes
of partition ranges and the readers themselves weren't sufficiently
extended.
Another problem, was that if the partition was no longer present in the
sstable the reader would return EOS which was then returned by
range_populating_reader itself causing its users to incorrectly
interpret that as an end of stream.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch makes sstable_streamed_mutation keep a reference to
sstable_data_source object which contains full state necessary to read
the sstable. That state is also shared with parent mutation_reader (only
for range queries), but now its lifetime is appropriately extended if
the mutation_reader is destoryed before streamed_mutation.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
sstable has member functions that create objects which need to extend
lifetime of the sstable (for example mutation_readers), the easiest way
to achieve that is to enable_lw_shared_from_this for sstable.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
get_sstables_including_compacted_undeleted() may return temporary shared
ptr which will be destroyed before the loop if not stored locally.
Fixes#1514
Message-Id: <20160728100504.GD2502@scylladb.com>
The current code assumes cell names are always compound and may
wrongly report a non-static row as such, since it looks at the first
bytes of the name assuming they are the component's length.
Tables with compact storage (which cannot contain static rows) may not
have a compound comparator, so we check for the table's compoundness
before checking for the static marker. We do this by delegating to
composite_view::is_static.
Fixes#1495
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-4-git-send-email-duarte@scylladb.com>
composite_view's is_static function is wrong because:
1) It doesn't guard against the composite being a compound;
2) Doesn't deal with widening due to integral promotions and
consequent sign extension.
This patch fixes this by ensuring there's only one correct
implementation of is_static, to avoid code duplication and
enforce test coverage.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-2-git-send-email-duarte@scylladb.com>
This patch handle two issues with check version:
* When checking for a version, the script send the current version
* Instead of string compare it uses parse_version to compare the
versions.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
There is a problem with python module installation in pythno3,
especially on centos. Though pytho34 has a normal package, alot of the
modules are missing yum installation and can only be installed by pip.
This patch switch the scylla-housekeeping implementation to use
python2, we should switch back to python3 when CeontOS python3 will be
more mature.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The test bit rot. One thing is that cache is no longer empty right
after creation, since we added ghost entries to it. Second problem is
that mutation serializer needs storage service to be initialized, so
we need to setup a full cql test env.
Message-Id: <1469626546-4279-1-git-send-email-tgrabiec@scylladb.com>
This patch changes the column_visitor so that it preservers the order
of the partitions it visits when building the accumulation result.
This is required by verbs such as get_range_slice, on top of which
users can implement paging. In such cases, the last key returned by
the query will be that start of the range for the next query. If
that key is not actually the last in the partitioner's order, then
the new request will likely result in duplicate values being sent.
Ref #693
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469568135-19644-1-git-send-email-duarte@scylladb.com>
If partition contains no static and clustering rows or range tombstones
mp_row_consumer will return disengaged mutation_fragment_opt with
is_mutation_end flag set to mark end of this partition.
Current, mutation_reader::impl code incorrectly recognized disengaged
mutation fragment as end of the stream of all mutations. This patch
fixes that by using is_mutation_end flag to determine whether end of
partition or end of stream was reached.
Fixes#1503.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1469525449-15525-1-git-send-email-pdziepak@scylladb.com>
I tried to start scylla-housekeeping service by:
# sudo systemctl restart scylla-housekeeping.service
But it's failed for wrong script path, error detail:
systemd[5605]: Failed at step EXEC spawning
/usr/lib/scylla/scylla-Housekeeping: No such file or directory
The right script name is 'scylla-housekeeping'
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <c11319a3c7d3f22f613f5f6708699be0aa6bd740.1469506477.git.amos@scylladb.com>
Similar to gdb's "thread apply all". Executes given command in the
context of each seastar thread.
For example to print backtrace of all threads:
scylla thread apply-all bt
If a composite is not a compound, then it doesn't carry a length
prefix where static information is encoded. In its absence, a
non-compound composite can never be static.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469397561-7748-1-git-send-email-duarte@scylladb.com>
This patch ensures we fail when creating a mixed column family, either
when adding columns to a dynamic CF through updated_column_family() or
when adding a dynamic column upon insertion.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469378658-19853-1-git-send-email-duarte@scylladb.com>
This patch implements the describe_splits_ex verbs by querying the
size_estimates system table for all the estimates in the specified
token range.
If the keys_per_split argument is bigger then the
estimated partitions count, then we merge ranges until keys_per_split
is met. Note that the tokens can't be split any further,
keys_per_split might be less than the reported number of keys in one
or more ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The query_size_estimates() function queries the size_estimates system
table for a given keyspace and table, filtering out the token ranges
according to the specified tokens.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch fixes stop() by checking if the current CPU instead of
whether the service is active (which it won't be at the time stop() is
called).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes range_estimates a proper struct, where tokens are
represented as dht::tokens rather than dht::ring_position*.
We also pass other arguments to update_ and clear_size_estimates by
copy, since one will already be required.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Cassandra 1.x clusters often use RandomPartitioner. Supporting
RandomPartitioner will allow easier migration to Scylla
Tests are added to make sure scylla generates the same token as
Cassandra does for the same partition key.
Fixes#1438
Message-Id: <3bc8b7f06fad16d59aaaa96e2827198ce74214c6.1469166766.git.asias@scylladb.com>
* seastar 9d1db3f...2425aab (5):
> Merge "Track all threads globally" from Tomasz
> net: provide remote address on native_server_socket_impl<Protocol>::accept()
> Introduce install-dependencies.sh
> reactor: make sure a poll cycle always happens when later is called
> Merge "rpc: various fixes and cleanups" from Gleb
This patch converts an exceptions::invalid_request_exception
into a Thrift InvalidRequestException instead of into a generic one.
This makes TitanDB work correctly, which expects an
InvalidRequestException when setting a non-existent keyspace.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469362086-1013-1-git-send-email-duarte@scylladb.com>
We use ::abs(), which has an int parameter, on long arguments, resulting
in incorrect results.
Switch to std::abs() instead, which has the correct overloads.
Fixes#1494.
Message-Id: <1469347802-28933-1-git-send-email-avi@scylladb.com>
A seastar::value_of() lambda used in a trace point was doing the unthinkable:
it called std::move() on a value captured by reference. Not only it compiled(!!!)
but it also actually std::move()ed the shared_ptr before it was used in a make_result()
which naturally caused a SIGSEG crash.
Fixes#1491
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1469193763-27631-1-git-send-email-vladz@cloudius-systems.com>
unaligned_cast violates strict aliasing, and causes code misgeneration on
gcc 6. Replace it with read_be/write_be, which are nicer anyway.
Message-Id: <1469122850-7511-1-git-send-email-avi@scylladb.com>
Fixes#1484.
We drop tables as part of keyspace drop. Table drop starts with
creating a snapshot on all shards. All shards must use the same
snapshot timestamp which, among other things, is part of the snapshot
name. The timestamp is generated using supplied timestamp generating
function (joinpoint object). The joinpoint object will wait for all
shards to arrive and then generate and return the timestamp.
However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
The fix is to give each table a separate joinpoint instance.
Message-Id: <1469117228-17879-1-git-send-email-tgrabiec@scylladb.com>
This patch changes lookup_schema() so it directly calls
database::find_schema() instead of going through
database::find_column_family(). It also drops conversion of the
no_such_column_family exeption, as that is already handled at a higher
layer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"When reading a partition try to read it all
but once more bytes are read than a given limit
we decide that partition is wide and we don't cache it.
Instead we retry the read with clustering key filtering
applied."
Previously, the Docker image could only be run interactively, which is
not conducive for running clusters. This patch makes the docker image
run in the background (using systemd). This makes the docker workflow
similar to working with virtual machines, i.e. the user launches a
container, and once it is running they can connect to it with
docker exec -it <container_name> bash
and immediately use `cqlsh` to control it.
In addition, the configuration of scylla is done using established
scripts, such as `scylla_dev_mode_setup`, `scylla_cpuset_setup` and
`scylla_io_setup`, whereas previously code from these scripts was
duplicated into the docker startup file.
To specify seeds for making a cluster, use the --seeds command line
argument, e.g.
docker run -d --privileged scylladb/scylla
docker run -d --privileged scylladb/scylla --seeds 172.17.0.2
other options include --developer-mode, --cpuset, --broadcast-address
The --developer-mode option mode is on by default - so that we don't fail users
who just want to play with this.
The Dockerfile entrypoint script was rewritten as a few Python modules.
The move to Python is meritted because:
* Using `sed` to manipulate YAML is fragile
* Lack of proper command line parsing resulted in introducing ad-hoc environment variables
* Shell scripts don't throw exceptions, and it's easy to forget to check exit codes for every single command
I've made an effort to make the entrypoint `go' script very simple and readable.
The goary details are hidden inside the other python modules.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1468938693-32168-1-git-send-email-yoav@scylladb.com>
If mutation is bigger than this limit
it won't be read and mutation_from_streamed_mutation
will return empty optional.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
- Don't use inline for templates.
- Put "inline" qualifier for out-of-class defined methods
where they are defined and not where they are declared.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Now date tiered compaction strategy will take into account the
strategy options which are defined in the schema.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This series includes the following:
- Introduction of a formatted message support in trace().
- Major rename: s/flush_/write_/, s/flush()/kick()/, s/store_/write_/.
- Some cosmetic fixes found on the way.
- Fix a bug in a shutdown flow.
- Instrumentation to MUTATE, PREPARE, EXECUTE and BATCH flow and some
related changes.
- A patch that aligns the QUERY tracing format with the Origin.
- Methods and functions description in tracing/trace_state.hh."
Add a proper description to a tracing::trace() that clarifies
that the tracing message string and the positional parameters
are going to be copied if tracing state is initialized.
Add a description for trace_state::begin() methods and for a
tracing::begin() helper function.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Don't put the query name as a 'request' but rather save it as one of entries in a
'params' map.
- Save some additional query parameters in 'params'.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Store the trace state in the abstract_write_response_handler.
Instrument send_mutation RPC to receive an additional
rpc::optional parameter that will contain optional<trace_info>
value.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Adding this method allows to use tracing helper functions
and remove the no longer needed accessors in the query_state.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
From now on trace_state::trace() is able to receive the sprint-ready
format string with the arguments that will be applied only during
the flush event.
This patch also optimizes the way the source address is evaluated -
do it only once instead of twice if tracing is requested.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Sometimes we want to be able to set "params" map after we
started a tracing session, e.g. when the parameters values,
like a consistency level parsed from the "options" part of a binary frame,
are available only after some heavy part of a flow we would like
to trace.
This patch includes the following changes:
- No longer pass a map to the begin().
- Limit the parameters to the known set.
- Define a method to set each such parameter and save its
value till the final sstring->sstring map is created.
- Construct the final sstring->sstring map in the destructor of the trace_state
object in order to defer all the formatting to be after the traced flow.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
A backend helper has to constantly communicate with the corresponding
tracing::tracing instance. By saving a reference to the tracing::tracing instance
will save us a lot of tracing::get_local_tracing_instance() calls and thus
a lot of dereferencing.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Extend the i_tracing_backend_helper interface to accept the event
record timestamp.
- Grab the current timestamp when the event record is taken.
- Add the instrumentation to the trace_keyspace_helper to create a unique time-UUID
from a given std::chrono::duration object.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This helper returns an std::experimental::optional<trace_info>
which is initialized or not initialized depending on whether
a given trace_state_ptr is initialized or not.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Add an support for passing a format string plus positional parameters
for creation of a trace point message.
Format string should be given in a fmt library native format described
here: http://fmtlib.net/latest/syntax.html#syntax .
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
kick() backend during shutdown and restrict accessing a backend
after that.
Flush pending records when service is being shut down.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Prevent the destruction of tracing::tracing instances while there
are still tracing::trace_state objects that are using it:
- Make tracing::tracing inherit from seastar::async_sharded_service<tracing::tracing>.
- Grab a tracing::tracing.shared_from_this() in each
tracing::trace_state object using it.
- Use a saved pointer to the local tracing::tracing instance in a destructor
instead of accessing it via tracing::get_local_tracing_instance()
to avoid "local is not initialized" assert when sessions are
being destroyed after the service was stopped.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In names of functions and variables:
s/flush_/write_/
s/store_/write_/
In a i_tracing_backend_helper:
s/flush()/kick()/
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Promoted index may cause sstable to have range tombstones duplicated
several times. These duplicates appear in the "wrong" place since they
are smaller than the entity preceeding them.
This patch ignores such duplicates by skipping range tombstones that are
smaller than previously read ones.
Moreover, these duplicted range tombstone may appear in the middle of
clustering row, so the sstable reader has also gained the ability to
merge parts of the row in such cases.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"This patchset implements the size_estimates_recorder, which periodically
writes estimations for all the non-system column families in the
size_estimates system table. This table is updated per schema with a set
of token ranges and the associated estimations of how many partitions
there are and their mean size.
Fixes#352"
This patch implements the size_estimates_recorder, which periodically
writes estimations for all the non-system column families in the
size_estimates system table. The size_estimates_recorder class
corresponds to the one in Cassandra's SizeEstimatesRecorder.java.
Estimation is carried out by shard 0. Since we're estimating based on
data in shared sstables, having multiple shards doing this would skew
the results.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements functions that allow the size_estimates system
table to be updated and cleared. The size_estimates table is updated
per schema with a set of token ranges and the associated estimations
of how many partitions there are and their mean size.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds an utility function that allows fetching the set of
column_families that do not belong to the system keyspace.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch allows a set of a column_family's sstables to be
selected according to a range of ring_positions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes it so that the template arguments of
range<T>::transform are more easily deducible by the compiler.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
That was a bug in the test itself. It could happen that a sstable would
incorrectly belong to the next time window if the current minute is
approaching its end. Fix is about having all sstables that we want in
the same time window with the same min/max timestamp.
Fixes#1448.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ee25d49e7ed12b4cf7d018a08163404c3d122e56.1468782787.git.raphaelsc@scylladb.com>
This patch avoids moving entries from range tombstones and clustering
rows sets in streamed_mutation_from_mutation(). Such action breaks these
sets as the entries will be left in some unknown state.
Instead, the sets are being broken in a supported and predictable way
using unlink_leftmost_without_rebalance().
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1468843205-18852-1-git-send-email-pdziepak@scylladb.com>
* seastar d699205...a45823a (5):
> rpc: do not call shutdown function on already closed fd
> log: Do not crash if logger is invoked from non-reactor thread
> rpc: remove unaligned_cast and reinterpret_cast uses
> unaligned: note unaligned_cast<> is deprecated
> byteorder: add unaligned read/write helpers
Fixes#1463.
This patch adds authorization to the DDL thrift verbs. Since checking
for authorization is asynchronous, we now need to copy the verb
arguments so they can be accessed from the continuations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This function is similar to has_column_family_access, but skips
validating if the specified keyspace and column family names map to a
valid schema, as it already takes one as an argument.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch transforms the mutation map, a map of keys to a map of columns
families to mutations, into a map of column families to a map of keys
to mutations. This makes is a more natural organization, as things
like checking access permissions are done by column family.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This is a wrapper around with_cob, which fetches a schema and forwards
it to a supplied function.
The patch also removes superfluous return instructions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch validates that a user is correctly logged in (if
authentication is required) for the required verbs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The scylla-server.service will try to start the scylla-housekeeping.
This patch adds a question to the scylla_setup if to enable the version
check, if the answer is no, the scylla-housekeeping will be masked.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1468741129-1977-1-git-send-email-amnon@scylladb.com>
Move the to_bytes_view(temporary_buffer<char>) function from source file
to header file where is can be used in more places.
This saves one use of reinterpret_cast (which we are no re-evaluating),
and moreover, we want to use this function also in the promoted index
code (to return a bytes_view from the promoted index which was saved as a
temporary_buffer).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1468761437-27046-1-git-send-email-nyh@scylladb.com>
* seastar 5e97d5f...d699205 (3):
> rpc: fix race between send loop and expiration timer
> rpc: fix cancellable type move operations
> reactor: create new files with a more reasonable default mode
Range queries need to take special care when transitioning between
ranges that are read from sstables and ranges that are already in the
cache.
Original code in such case just started a secondary reader and told it
to unconditionally mark the last entry as continuous (primary reader has
already returned an element tha immediately follows the range that is
going to be read form sstables).
However, that information may get stale. For instance, by the time
secondary reader finish reading its range the element immediately
following it may get evicted from the cache thus causing continuity flag
to be incorrectly set.
The solution is to ensure that the element immediately after the range
read from sstables is still in the cache.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1468586893-15266-1-git-send-email-pdziepak@scylladb.com>
From Duarte:
This patchset adds support for the data manipulation verbs. It defers support
for super columns and mixed CFs (a static CF treated as dynamic) to later
patchsets.
Everything is done on top of storage_proxy; it was only necessary to modify the
layers below to add support for different kinds of limits: per partition row
limit, which corresponds to limiting the number of columns returned when
querying a dynamic CF, and limit on the number of partitions returned, so that
we can emulate the one thrift row per key model when querying dynamic CFs.
Ref #399
By default, the schema is marked as compound regardless of the
comparator. Since a composite comparator for static CFs is currently
unsupported (otherwise thrift column families would be
indistinguishable from CQL ones), just mark them as non-compound.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch prevents CQL3 column families from being returned to
clients or subject to updates from thrift.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The get_multi_slice verb is used to perform multiple slices on a
single row key in one operation. It takes a set of column_slices,
which we normalize to not contain any overlapping ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the deoverlap function to range.hh, which takes in a
vector of possibly overlapping ranges and returns a vector of
non-overlapping ranges covering the same values.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The get_paged_slice verb is similar to the get_range_slices verb,
except that it doesn't take a SlicePredicate. Instead, it takes a
column from which to start the query.
For dynamic CFs, we use the partition_slice::specific_ranges to single
out the first partition, and query starting from the start_column row.
For static CFs, we issue an initial query to fetch the remainder of
columns from the first partition, and at least one more query to fetch
the subsequent columns until the limit is reached. This implies a
performance penalty for static CFs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The get_range_slices verb is similar to the multiget_slice verb,
except that it operates on a range of partition keys (or tokens).
In origin, empty partitions are returned as part of the KeySlice, for
which the key will be filled in but the columns vector will be empty.
Since in our case we don't return empty partitions, we don't know which
partition keys in the specified range we should return back to the client.
So for now, our behavior differs from Origin.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the multiget_count verb in a similar fashion as
multiget_slice, but using an accumulator that counts the returned
columns instead of create thrift ColumnOrSuperColumn objects.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch build a query::read_command from a SlicePredicate,
for both dynamic and static column families.
For dynamic CFs, restrictions on the clustering columns are added, and
for static CFs, limits and ordering is defined inline by selecting the
correct regular columns.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support to send a cell's ttl as part of a query's
result. This is needed for thrift support.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the is_dynamic() function to thrift_schema, which
tells whether the underlying column family is dynamic or not,
according to thrift rules.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support for composite comparators (which, for dynamic
column families, it means composite clustering keys) and for composite
keys (composite partition keys).
Support for composite column names and regular columns is deferred,
which will entail making compound_type an abstract_type.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This series replaces the original scylla-help.py
It contains only a basic script that checks daily for version and report if a
newer version matched.
The script is added as a service and will be started and shutdown with
scylla-server."
Currently, for any column family, we create a directory for it in all
keyspace directories. This is incredibly awkward.
Fix by iterating over just the keyspace's column families, not all
column families in existence.
Fixes#1457.
Message-Id: <1468495182-18424-1-git-send-email-avi@scylladb.com>
The check version script uses the python requests package, this add the
dependency to the ubuntu package.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Ununtu 14.4 upstart does not support timers for recurrent operations.
The upstart cookbook suggest a way to mimic this functionality here:
http://upstart.ubuntu.com/cookbook/#run-a-job-periodically
This patch adds a service that runs the house-keeping daily.
Setting it as a service insure that it would start and stop with
scylla-server service.
filter_for_query() gets sorted by preference list of endpoints and
should preserve that order after filtering out non local endpoints for
local query. partition() does not guaranty this while stable_partition()
does, so use it instead.
Fixes#1450.
Message-Id: <20160713100909.GM10767@scylladb.com>
From Paweł:
This is another episode in the "convert X to streamed mutations" series.
Hashing mutations (mainly for repair) is converted so that it doesn't
need to rebuild whole mutation.
The first part of the series changes the way streamed mutations deal
with range tombstones. Since it is not necessary to make sure we write
disjoint tombstones to sstables there is no need anymore for streamed
mutations to produce disjoint tombstones and, consequently, no need for
range tombstones to be split into range_tombstone_begin and
range_tombstone_end.
The second part is the actual hashing implementation. However, to ensure
that the hash depends only on the contents of the mutation and no the
way it is stored in different data sources range tombstones have to be
made disjoint before they are hashed.
This series also ensures that any changes caused by streamed mutations
to hashing and streaming do not break repair during upgrade.
This patch makes hashing for repair calculate checksums in a way that
doesn't require rebuilding whole mutation.
Unfortunately, such checksums are incompatible with the old ones so the
old way for computing checksums is preserved for compatibility reasons.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The receiving side needs to handle fragmented mutations properly so that
isolation guarantees are not broken. If the receiving node may be an old
one do not fragment mutations.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
mutation_hasher is a consumer of streamed_mutation that feeds its data
to a specified hasher.
It is not compatible with hashing_partition_visitor.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Originally, streamed_mutations guaranteed that emitted tombstones are
disjoint. In order to achieve that two separate objects were produced
for each range tombstone: range_tombstone_begin and range_tombstone_end.
Unfortunately, this forced sstable writer to accumulate all clustering
rows between range_tombstone_begin and range_tombstone_end.
However, since there is no need to write disjoint tombstones to sstables
(see #1153 "Write range tombstones to sstables like Cassandra does") it
is also not necessary for streamed_mutations to produce disjoint range
tombstones.
This patch changes that by making streamed_mutation produce
range_tombstone objects directly.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
range_tombstone::flip() flips range bounds. This is necessary in order
to use range tombstone in reversed mutation fragment streams.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
range_tombstone_accumulator is a helper class that allows determining
tombstone for a clustering row when range tombstones and clustering rows
are streamed from streamed_mutation.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
range_tombstone::apply() allows merging two, possibly overlapping, range
tombstones with the same start bound and produces one or two disjoint
range tombstones as a result.
It is intended to be used for merging tombstones coming from different
sources.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The sstable parsing code calls mp_row_consumer::flush() after every
clustering row has been read, and this puts the now complete row in a single
field "_ready". The assumption is that at this point parsing will stop, the
consumer will move out this _ready (mp_row_consumer::get_mutation_fragment())
and when flush() is later called again, _ready will be empty again.
This assumption is correct in our code, but is based on an intricate
combination of estoreric parts of the code, such as:
1. In data_consume_row_context we stop parsing after reading the parition's
header, before reading any clustering rows, giving the caller the chance
to call sstable_streamed_mutation::read_next() to be prepared for the
incoming mutations.
2. In mp_row_consumer::flush_if_needed(), we stop the parser after each
individual clustering row.
It is easy to break this assumption, and I did this in one of my code changes,
and the result was silent loss of clustering rows, as "_ready" got silently
overwritten before the reader had a chance to move it out.
What this patch does is to add an assertion: If a clustering row is silently
lost before being transferred to the mutation fragment reader, we croak.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1468389955-24600-1-git-send-email-nyh@scylladb.com>
This patch fixes a regression introduced in
f81329be60, which made keys compound by
default when using a particular ctor, in turn leading to mismatches
when comparing the same key built with functions that properly
consider compoundness.
As a temporary fix, the sstable::key and sstable::key_view classes
store raw bytes instead of a composite.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1468339295-3924-1-git-send-email-duarte@scylladb.com>
Since the timestamp is not serialized, it must always be the last
parameter of query::read_command. This patch reorders it with the
partition_limit parameters and updates callers that specified a
timestamp argument.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1468312334-10623-1-git-send-email-duarte@scylladb.com>
This makes scylla-server to try and start the scylla-housekeeping.
Failing to start the service will not interfere with the scylla-server
start.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
From Duarte:
This patchset adds a representation of a legacy composite
value to compound_compat.hh and replaces the one in
sstables/key.hh. This patchset is needed for the thrift series.
The scylla housekeeping service responsible for recurent tasks.
It is currently set to run daily and report if the version is correct.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
scylla-housekeeping is a script that check and report for hardware and software issues.
The first phase of it check for newer version and report if the version
is old.
To see the available options run
scylla-housekeeping help
We have imported most of our data about config options from Cassandra. Due to
that, many options that mention the database by name are still using
"Cassandra".
Specially for the user visible options, which is something that a user sees, we
should really be using Scylla here.
This patch was created by automatically replacing every occurrence of "Cassandra"
with "Scylla" and then later on discarding the ones in which the change didn't
make sense (such as Unused options and mentions to the Cassandra documentation)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1423e1d7e36874a1f46bd091aec96dcb4d8482d9.1468267193.git.glauber@scylladb.com>
The sstables::key class now delegates much of its functionality
to the composite class. All existing behavior is preserved.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support for parsing legacy compound values by
introducing the composite class, a wrapper around a sequence of bytes
serialized in the legacy format for compounds. Compound values can be
sent though the thrift API.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar 9267dfa...e7a7d41 (3):
> Merge "Compression support for RPC" from Gleb
> reactor: allow sleeping while disk aio is pending
> sstring: add resize method
Continuation reordering could cause us to repeatedly see the
segment-local flag var even though actual write/sync ops are done.
Can cause wild recursion without actual delayed continuation ->
SOE.
Fix by also checking queue status, since this is the wait object.
Message-Id: <1468234873-13581-1-git-send-email-calle@scylladb.com>
Continuation reordering could cause us to repeatedly see the
segment-local flag var even though actual write/sync ops are done.
Can cause wild recursion without actual delayed continuation ->
SOE.
Fix by also checking queue status, since this is the wait object.
centos-master jenkins job failed at building libgo, but we don't need go language, so let's disable it on scylla-gcc package.
Also we never use ada, disable it too.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1468166660-23323-1-git-send-email-syuu@scylladb.com>
"Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.
This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.
I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).
Refs #1322."
Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.
This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.
I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).
Refs #1322.
memtable_list::seal_on_overlflow() is called on each mutation to check
if current memtable should be flushed. It will call
memtable_list::seal_active_memtable() when that is the case.
The number of concurrent seals is guarded by a semaphore, starting
from commit 0f64eb7e7d, and allows
at most 4 of them.
If there are 4 flushes already pending, every incoming mutation will
enqueue a new flush task on the semaphore's wait list, without waiting
for it. The wait queue can grow without bounds, eventually leading to
out-of-memory.
The fix is to seal the memtable immediately to satisfy should_flush()
condition, but limit concurrency of actual flushes. This way the wait
queue size on the semaphore is limited by memtables pending a flush,
which is fairly limited.
Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>
With so many consumer concepts out there, it is confusing to name
parameters using genering "Consumer" name, let's name them after
(already defined) concepts: CompactedMutationsConsumer, FlattenedConsumer.
Previously, same function was used to handle both regular compaction
and cleanup requests. That's bad because a lot of conditions were
added for both compaction types to live in the same function.
Now, cleanup and regular compaction will live in different functions.
They share a lot of code, so helper functions were introduced.
This change is also important for user-initiated compaction that
will go through compaction manager in the future.
Code is also a lot easier to read now.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There is no longer a need to use gate for regular termination of
fiber that runs compaction. Now, we only set task->stopping to
true, ask for compaction termination, and wait for its future to
resolve. Code is simplified a lot with this change.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Enable --partitioner option so that user can choose partitioner other
than the default Murmur3Partitioner. Currently, only Murmur3Partitioner
and ByteOrderedPartitioner are supported. When non-supported partitioner
is specifed, error will be propogated to user.
In order to support ByteOrderedPartitioner, we need to implement the
missing describe_ownership and midpoint function in
byte_ordered_partitioner class.
As a starter, this path uses a simple node token distance based method
to calculate ownership. C* uses a complicated key samples based method.
We can switch to what C* does later.
Tests are added to tests/partitioner_test.cc.
Fixes#1378
If a node fails to talk to any seed node, shadow round will fail. We
should exit shadow round state before we continue.
This issue is spotted by
consistency_test.TestConsistency.data_query_digest_test dtest.
Message-Id: <ba0613532a69bac369ca316ab61d907b320c8e68.1467963674.git.asias@scylladb.com>
As Nadav notes we use the chunk length as the buffer size for the compressed
stream too.
Fix by using it only for the outer (uncompressed) stream; the inner
(compressed) stream uses the sstable buffer size, 128 kiB.
Fixes#1402.
Message-Id: <1467910556-5759-1-git-send-email-avi@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Support for streaming of large partitions from Paweł:
This series converts streaming to streaming_mutations so that there
is need to store full mutation in memory in order to send or receive
it.
The first several patches add a way of estimating mutation fragment
memory usage and introduce fragment_and_freeze() which produces
a stream of reasonably sized frozen mutations from a single streamed
mutation.
The second part of this patchset makes sure that streaming mutations
in fragments doesn't break isolation guarantees. This is achieved by
delaying visibility of sstables produced by streaming until the
streaming is completed. However, our current receiving code merges
mutations from all streaming plans together thus making it impossible
to track which data was received from a particular streaming plan.
The solution to that problem is to introduce an additional flag to
STREAM_MUTATION verb which informs the receiver whether the mutation
is fragmented and care must be taken to preserve isolation. Small
mutations behaved as they were, with writes from different stream
plans coalesced while big mutations are handled separately for each
streaming task.
Commit 206955e4 "streaming: Reduce memory usage when sending mutations"
moved streaming mutation limiter from do_send_mutations() to
send_mutations(). The reason for that was that send_mutation() did full
mutation copies. That's no longer the case and streaming limiter should
be moved back to do_send_mutation() in order to provide back pressure to
fragment_and_freeze().
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If mutations are fragmented during streaming a special care must be
taken so that isolation guarantees are not broken.
Mutations received with flag "fragmented" set are applied to a memtable
that is used only by that particular streaming task and the sstables
created by flushing such memtables are not made visible until the task
is complte. Also, in case the streaming fails all data is dropped.
This means that fragmented mutations cannot benefit from coalescing of
writes from multiple streaming plans, hence separate way of handling
them so that there is no loss of performance for small partitions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
plan_id is needed to keep track of the origin of mutations so that if
they are fragmented all fragments are made visible at the same time,
when that particular streaming plan_id completes.
Basically, each streaming plan that sends big (fragmented) mutations is
going to have its own memtables and a list of sstables which will get
flushed and made visible when that plan completes (or dropped if it
fails).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When flush_streaming_mutations() is called at the end of streaming it is
supposed to flush all data and then invalidate cache. ranges However, if
there are already some memtable flushes in progress it won't wait for them.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The purpose of this patch is to split the actions of writing sstable and
sealing it. As long as the sstable is unsealed it is considered
incomplete and is going to be removed on reboot.
Such functionality is needed in order to defer visibility of sstables
created during streaming until the streaming is complete.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
* seastar c82c36f...9267dfa (6):
> app_template: Make run() wait for func when reactor exit is triggered externally
> core: Introduce futurize_apply() helper
> rpc: make unexpected eof messages more informative
> Fix boost version check
> reactor: more fix for smp poll with older boost
> reactor: fix build on older boost due to spsc_queue::read_available()
2a46410f4a changed sstable_list from a map
to a set, so it is no longer sorted by generation. The code for finding
the list of sstables not being compacted relied on this sort order, and
now broke, returning a longer list than needed (including some of the
sstables being compacted). As a result, the compaction code preserved
the tombstones, incorrectly thinking there was still live data they
referenced.
Fix by sorting the set explicitly.
Fixes#1429.
Message-Id: <1467793026-6571-1-git-send-email-avi@scylladb.com>
"After this patchset, date tiered compaction strategy is supported by Scylla.
For those who don't know what it is about, the following article may help:
https://labs.spotify.com/2014/12/18/date-tiered-compaction/
It's also nicely explained here by our wiki page:
https://github.com/scylladb/scylla/wiki/SSTable-compaction#date-tiered-compaction
Basically, date tiered strategy was developed to help the database perform better
when facing a time series workload. Date tiered strategy will work to keep data
written at nearly the same time together, such that the number of relevant sstables
for a time-based query is relatively low. We still lacks support to filter out
sstables based on time parameters of a query, but that feature should come ASAP.
The following dtests now pass:
compaction_test.py:TestCompaction_with_DateTieredCompactionStrategy.compaction_delete_test
compaction_test.py:TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test
Used cassandra-stress with the parameter '-schema compaction\(strategy=DateTieredCompactionStrategy\)'
to check stability.
Fixes #511."
"Issue 1195 describes a scenario with a fairly easy reproducer in which
we can freeze the database. That involves writing simultaneously to
multiple CFs, such that the sum of all the memory they are using is larger
than the dirty memory limit, without not any of them individually being
larger than the memtable size.
This patchset rewrites the throttling code, including now active flushes
so that this situation cannot happen.
Fixes#1195"
This commit is basically about converting Java to C++.
Date tiered compaction strategy isn't wired yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Strongly based on org.apache.cassandra.db.compaction.
CompactionController.getFullyExpiredSSTables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We weren't updating max local deletion time for cells that contain
ttl, or for tombstone cells.
If there is a live cell with no ttl, then max local deletion time
is supposed to store maximum value, which means that the sstable
will not be fully expired later on.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
File can be found at the following C* directory:
src/java/org/apache/cassandra/db/compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Issue 1195 describes a scenario with a fairly easy reproducer in which we can
freeze the database. That involves writing simultaneously to multiple CFs, such
that the sum of all the memory they are using is larger than the dirty memory
limit, without not any of them individually being larger than the memtable size.
Because we will never reach the individual memtable seal size for any of them,
none of them will initiate a flush leading the database to a halt.
The LSA has now gained infrastructure that allow us to be notified when pressure
conditions mount. What we will do in this case is initiate a flush ourselves.
Fixes#1195
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the spirit of what we are doing for the read semaphore, this patch moves
system writes to its own dirty memory manager. Not only will it make sure that
system tables will not be serialized by its own semaphore, but it will also put
system tables in its own region group.
Moving system tables to its own region group has the advantage that system
requests won't be waiting during throttle behind a potentially big queue of user
requests, since requests are tended to in FIFO order within the same region
group. However, system tables being more controlled and predictable, we can
actually go a step further and give them some extra reservation so they may not
necessarily block even if under pressure (up to 10 MB more).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We currently have a semaphore in the column family level that protects us against
multiple concurrent sstable flushes. However, storing that semaphore into the CF,
not the database, was a (implementation, not design) mistake.
One comment in particular makes it quite clear:
// Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind
// would take care of the rest. But that still has issues, so we'll limit parallelism to some
// number (4), that we will hopefully reduce to 1 when write behind works.
So I aimed for the shard, but ended up coding it into the CF because that's closer to the
flush point - my bad.
This patch fixes this while paving the way for active reclaim to take place. It wraps the semaphore
and the region group in a new structure, the dirty_memory_manager. The immediate benefit is that we
don't need to be passing both the semaphore and the region group downwards in the DB -> CF path. The
long term benefit is that we now have a one unified structure that can hold shared flush data in all
of the CFs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA memory pressure mechanism will let us know which region is the best
candidate for eviction when under pressure. We need to somehow then translate
region -> memtable -> column family.
The easiest way to convert from region to memtable, is having memtable inherit
from region. Despite the fact that this requires multiple inheritance, which
always raise a flag a bit, the other class we inherit from is
enable_shared_from_this, which has a very simple and well defined interface. So
I think it is worthy for us to do it.
Once we have the memtable, grabing the column family is easy provided we have a
database object. We can grab it from the schema.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA infrastructure, through the use of its region groups, now have
a throttler mechanism built-in. This patch converts the current throttlers
so that the LSA throttler is used instead.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This series contains remaining changes necessary to safely enable read
ahead of sstables. Basically, it makes sure that input_streams are
always properly closed (even in case of exception during read)."
At the moment, it's not possible to know how many compaction are needed for
compaction strategy to be satisfied. It's not possible to know exactly the
number of pending compaction, but the strategy can provide an estimation.
For size tiered, it's based on number of sstables in each bucket. By dividing
bucket size by max threshold, we get number of compaction needed to compact
that single bucket.
For leveled, it's about the number of sstables that exceeds the limit in
each level.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <e209e52f6159ee274a8358b69961a7c0ce357f7d.1467667054.git.raphaelsc@scylladb.com>
Checking features for seed node is a bit more complicated than non-seed
node, because non-seed node can always talk to at least one seed node,
seed node may not.
In this patch, we distingush new cluster and existing cluster by
checking if the system table is empty. We relax the feature check for
new cluster because the feature check is mostly useful when upgrading an
existing cluster to prevent old node to join new cluster.
When talking to a seed node failed during the check, we fallback to the
check using features stored in the system table. This makes restarting a
seed node when no other seed node is up possible (no other seed node at
all, or other seed node is not up yet).
I tested the following scenarios.
1) start a completely new seed node in a new cluster
* system table is empty, skip the check.
2) start a cluster, restart one seed node, at least one other seed node
is up
* system table is not empty, check with shadow round, shadow round will
* succeed
3) start a cluster, restart one seed node, no other seed node is up
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check.
4) start a cluster, shutdown all the nodes, start one seed node with new
ip address, seed list in yaml is updated with new ip address
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check
In 3a36ec33db (gossip: Wait longer for seed node during boot up), we
increased the timeout by the factor of 60, i.e., ring_dealy * 60 = 5
seconds * 60 = 5 minutes.
In 57ee9676c2 (storage_service: Fix default ring_delay time), we fixed
the default ring_dealy to 30 seconds. Now the timeout is 30 * 60 seconds
= 30 minutes, which is too long.
Make it 5 minues.
If someone tried to naively use utf8_type->decompose("18wX"), this would
mysteriously fail, returning an empty key.
decompose takes a data_value, so the compiler looked for an implict
conversion from the string constant (const char*) to data_value. We did
not have such a conversion, only conversion from sstring. But the compiler
chose (backed by the C++ standard, no doubt) to implicitly convert the
const char* to a bool (!), and then use data_value(bool). It did not
convert the const char* to an sstring, nor did it warn about the possible
ambiguity.
So this patch adds a data_value(const char*) constructor, so people will
not fall into the same trap that I fell into...
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1467643462-6349-1-git-send-email-nyh@scylladb.com>
In a leveled column family, there can be many thousands of sstables, since
each sstable is limited to a relatively small size (160M by default).
With the current approach of reading from all sstables in parallel, cpu
quickly becomes a bottleneck as we need to check the bloom filter for each
of these sstables.
This patch addresses the problem by introducing a
compaction-strategy-specific data structure for holding sstables. This
data structure has a method to obtain the sstables used for a read.
For leveled compaction strategy, this data structure is an interval map,
which can be efficiently used to select the right sstables.
When a seed node boots up with more than one node in the seed list, it
will fail to talk to the other seed node which is not up yet.
This fails the feature check, so the seed node will not boot.
Skip the feature check for seed node for now, util we have a proper solution.
Fixes recent dtest failure due to fail to boot the seed node.
Message-Id: <e1d4110f96817e45f81dc0bc948dd14600fc5333.1467251799.git.asias@scylladb.com>
"This series converts sstable writers (including compaction) to streamed
mutations and makes them use consumer-style interface.
Code related to sstable writes and compaction is converted to consumers
that can be used with consume_flattened_in_thread() (which is a variant
of consume_flattened() intended to be run inside a thread).
compac_for_query is improved so that it can be reused by sstable
compaction."
Amnon says:
The API that returns the version, currently returns the compatibility
version
(e.g. the version the compatible origin version - currently 2.1.8).
The check version functionality need to know what is the current running
version of scylla. For that a new API was added that return the current
version.
The result is equivalent of running scylla --version.
After this series a call to:
$ curl -X GET
"http://localhost:10000/storage_service/scylla_release_version"
"666.development-20160703.72f0d4d"
Which is the json representation of:
$ ./build/release/scylla --version
666.development-20160703.72f0d4d
We currently log as follow:
May 9 00:09:13 node3.nl scylla[2546]: [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again
Howerver, user should use
override_decommission:true
instead of
cassandra.override_decommission:true
in scylla.yaml where the cassandra prefix is stripped.
Fixes#1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>
data_resource lookup uses data_resource::name(), which uses sprint(), which
uses (indirectly) locale, which takes a global lock. This is a bottleneck
on large machines.
Fix by not using name() during lookup.
Fixes#1419
Message-Id: <1467616296-17645-1-git-send-email-avi@scylladb.com>
This adds a definition to the scylla release version. The API already
return the compatibility version (ie. the compatible origin version)
This definition returns the scylla version, a call to the API should
return the same result as running scylla --version.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Apply compaction strategy specific logic to narrow down the set of sstables
used for a query; can speed up reads using LeveledCompactionStrategy
significantly.
Fixes#1185.
Using sstable_set will allow us to filter sstables during a query before
actually creating a reader (this is left to the next patch; here we just
convert the users of the _sstables field).
Allow compaction_strategy to create a container for sstables that is
optimized for the strategy.
Most compaction_strategies return bag_sstable_set; leveled compaction
returns the specialized partitioned_sstable_set.
partitioned_sstable_set assumes that sstable are mostly partitioned along
the token range: only a few sstables will be needed to access a particular
token. It is implemented as an interval_map.
bag_sstable_set is a generic sstable_set implementation: it assumes nothing
about the sstables. It is implemented as a vector, and any select will
return the entire sstable set.
sstable_set abstracts the notion of a container of sstables, allowing
different compaction strategies to supply their own implementation. The
intended user is leveled compaction strategy; since it partitions sstables,
it can quickly restrict the number of sstables that participate in a query
by looking at the min/max partition key.
sstable_set also maintains an internal lw_shared_ptr<sstable_list>,
in parallel with the abstract container. This is to support
column_family::get_sstable(), which returns a lw_shared_ptr<sstable_list>
which must be anchored somewhere if it is not saved at the caller side,
as it isn't in most current callers.
ring_position is built for modern code that does not require default
constructors or stateless comparators. But not all code is modern, so
supply a compatible_ring_position that works with old code, at the cost
of some extra storage. Intended user is boost's interval container
library.
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set. The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
If read ahead is going to be enabled it is important to close
input_stream<> properly (and wait for completion) before destroying it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch moves compaction logic to a consumer that can be used with
consume_flattened_in_thread(). Internally, sstable_writer is used to
write individual sstables.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
sstable_writer encapsulates all logic related to writing sstable.
Previously introduced component_writer is used to write actual
mutations. sstable_writer is intended to be used with
consume_flattened_in_thread(). Its purpose is to be used by higher-level
consumer that needs to write possibly more than one sstable (sstable
compaction is an example of such consumer).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch rewrites do_write_components() so that it can use
consume_flattened_in_thread(). All components-writing code is moved to a
new consumer: component_writer.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This is a version of consume_flattened() intended to be run inside a
thread. All consumer code is going to be invoked in the same thread
context.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
compact_mutation code is going to be shared among queries and sstable
compaction. There are some differences though. Queries don't provide
_max_purgeable and sstable compaction don't need any limits.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
_max_perguable can be different for each partition, since it is computed
using sstables in which that partition is present (or likely to be
present).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Since decorated keys are already computed it is better to pass more
information than less. Consumers interested just in partition key can
just drop token and the ones requiring full decorated key don't need to
recompute it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
query::full_slice is a partiton slice which has full clustering row
ranges for all partition keys and no per-partition row limit.
Options and columns are not set.
It is used as a helper object in cases when a reference to
partition_slice is needed but the user code needs just all data there is
(an example of such case would be sstable compaction).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently, each sstable write has its separate thread. However, the goal
is to have compaction use consume_flattened() with a consumer that
creates and writes the sstables. consume_flattened() needs to be executed
inside a thread, since sstable writer may defer.
This patch is a first step in preparations and it just makes whole
compaction logic run inside a thread. That makes little sense now, since
all sstable writes spawn their own threads but that's going to change
in the following patches.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The partition_limit should have been added to the end of the ctor
argument list, as its current placement causes some callers to pass it
the timestamp instead of the limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1467239360-6853-3-git-send-email-duarte@scylladb.com>
While limiting the number of concurrently executing sstable readers reduces
our memory load, the queued readers, although consuming a small amount of
memory, can still grow without bounds.
To limit the damage, add two limits on the queue:
- a timeout, which is equal to the read timeout
- a queue length limit, which is equal to 2% of the shard memory divided
by an estimate of the queued request size (1kb)
Together, these limits bound the amount of memory needed by queued disk
requests in case the disk can't keep up.
Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>
At the moment, we only trigger compaction after creating a new
sstable as a result of memtable flush, or some other event such
as changing compaction strategy of a column family.
However, it's important to trigger compaction on boot too.
That will happen after loading all column families.
Fixes#1404.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <54d38a418157454eec97aaba6b8a6b6e51484db4.1467135349.git.raphaelsc@scylladb.com>
The scylla-jmx no longer shutdown itself. A better setup would be that
the it would be started when the scylla-server starts and that it would
shutdown when the scylla-server shutdown.
This patch do the scylla-server part of the change.
The scylla-server definition would Want the scylla-jmx.service so there
is no need to enable the scylla-jmx.service.
A patch to the scylla-jmx would cause it to shutdown when the scylla-jmx
shutsdown.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1467184502-4358-1-git-send-email-amnon@scylladb.com>
Following Cassandra, our default sstable compression chunk size is 64 KB.
The big downside of this default size is that small reads need to read
and uncompress a large chunk, around 32 KB (if compression halves the data
size). In this patch we switch the default chunk size to 4 KB, which allows
faster small reads (the report in issue #1337 was of a 60-fold speedup...).
Since commit 2f56577, large reads will not be signficantly slowed down by
changing to a small chunk size. The remaining potential downside of this
change is lowering of the compression ratio because of the smaller chunks
individually compressed. However, experimentation shows that the compression
ratio is hurt somewhat, but not dramatically, by lowering the chunk size:
A recent survey of Cassandra compression in
https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/
reports a compression ratio of 2 for 64 KB chunks, vs. 1.75 for 4 KB chunks.
My own test on a cassandra-stress workload (whose data is relatively hard
to compress), showed compression ratio 1.25 for 64 KB chunk, vs. 1.23 for
4 KB chunks.
Also remember that if a user wants to control the chunk length for a
particular table, he can - the 64 KB or 4 KB sizes are just the default.
Fixes#1337
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1467063335-12096-1-git-send-email-nyh@scylladb.com>
Helps in identifying pointers allocated through seastar
allocator. Shows to which thread the pointer belongs, to which size
class, whether it's live or free, what's the offset realtive to the
live object.
Example:
(gdb) scylla ptr 0x6040abe88170
thread 1, small (size <= 320), live (0x6040abe88140 +48)
Message-Id: <1467047215-1763-1-git-send-email-tgrabiec@scylladb.com>
* seastar 3029ebe...c15055c (5):
> memory: add option to mlock() all memory
> reactor: run idle poll handler with a pure poll function
> ignore all but one failed futures in map_reduce
> tutorial: more general exception printout on startup
> resource: don't abort on too-high io queue count
Fixes#1395.
Fixes#1400.
From Avi:
Both the cql binary transport and the rpc server have protection against
too many concurrent requests overwhelming the database due to transient
allocations. There work by estimating the amount of memory a request
requires, and accounting that against a semaphore. When the semaphore
blocks, we stop dequeing requests from the tcp connection.
Unfortunately, this doesn't work for reads, because we can't estimate the
required memory size. A small read request can require many sstables to be
read, perhaps concurrently, and a large response to be generated.
Fix by limiting the number of concurrent reads in a shard to 100. This
is more than enough concurrency for any reasonable disk, and there is no
network communication at this level, so we're safe from high network
latency requiring high concurrency.
Fixes#1398.
Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard. This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.
Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled. This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.
Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads. Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.
A restricting_reader wraps a mutation_reader, and restricts it concurrency
using a provided semaphore; this allows controlling read concurrency, which
is important since reads can consume a lot of resources ((number of
participating sstables) * 128k after we have streaming mutations, and a lot
more before).
This patch adds support for thrift prepared statements. It specializes
the result_message::prepared into two types:
result_message::prepared::cql and result_message::prepared::thrift, as
their identifiers have different types.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
query_options::prepare() changes the values array, but this is not the
one used by query_options internally (e.g., in get_value_at). So we
need to also recalculate the value_views after prepare() is called.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Having both the values and value_views arguments in the query_options
ctor is confusing, since query_options uses only the value_views field
but that is not communicated to the caller.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Similarly to the with_cob functions, this one takes the exn_cob
function and ensures it is called in case of an exception. This
is useful when the return type of the thrift verb is not nothrow
move constructible; by holding on to the cob inside the verb and
calling it directly when we have the result we avoid having to
wrap it in a smart pointer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
gcc 6 complains that deleting a managed_bytes::external isn't defined
because the size isn't known. I'm not sure it's correct, but there's no
way to tell because flexible arrays aren't standardized.
Fix by using an array of zero size.
Message-Id: <1466715187-4125-1-git-send-email-avi@scylladb.com>
Using a template lambda invokes a bug in Fedora 24's boost where the
lambda's parameter is an internal boost type rather than a range_tombestone.
Constraining the parameter with an explicit type avoids the problem.
Message-Id: <1466844211-17298-1-git-send-email-avi@scylladb.com>
This adds to the definition of the collectd API the ability to turn on
and off specific collectd metrics.
For the GET end point a POST option was added that allow to enable or
disable a metric.
The general GET endpoint now returns the enable flag that indicates if
the metric is enable.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1466932139-19264-2-git-send-email-amnon@scylladb.com>
"This patchset implements the thrib describe verbs:
- describe_keyspace
- describe_keyspaces
- describe_cluster_name
- describe_version
- describe_ring
- describe_local_ring
- describe_token_map
- describe_partitioner
- describe_snitch
- describe_schema_versions
The verbs describe_splits and describe_splits_ex are not implemented
because they are marked as experimentail (Origin's thrift interface has
this to say about them: "experimental API for hadoop/parallel query
support. may change violently and without warning."). Some drivers have
moved away from depending on this verb (SPARKC-94). The correct way to
implement the verbs for us would be to use the size_estimates system table
(CASSANDRA-7688). However, we currently don't populate size_estimates, which
is done by SizeEstimatesRecorder.java in Origin."
This patch removes a conversion function from an internal type
name to Origin's naming, which isn't needed because the
abstract_type hierarchy already keeps that mapping.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We don't implement describe_splits, and this patch describes why that
it. In a nutshell, to properly implement this, we would need something
like Origin's SizeEstimatesRecorder.java, but as the verb is marked as
experimental, we don't do it for now.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch completes the system_add_keyspace verb by setting all
relevant options on the new schemas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In thrift, a static column family is one where all columns are
defined upon schema creation. It maps to a CQL table with a singular
partition key and a set of regular columns.
On the other hand, a dynamic column family is one which allows column
to be dynamically added by insertion requests. It maps to a CQL table
with a partition key and a clustering key, which will hold the names of
the dynamic columns, and a regular column, which will how the respective
values. If the thrift comparator type is composite, then there will be a
clustering column for each of the composite's components.
There can also be mixed column families; supporting those is future
work.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves the make_exception function from thrift/handler.cc to
the new header file thrift/utils.hh.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We currently have a problem in update_cache, that can be trigger by ordering
issues related to memtable flush termination (not initiation) and/or
update_cache() call duration.
That issue is described in #1364, and in short, happens if a call to
update_cache starts before and ongoing call finishes. There is now a new SSTable
that should be consulted by the presence checker that is not.
The partition checker operates in a stale list because we need to make sure the
SSTable we just wrote is excluded from it. This patch changes the partition
checker so that all SSTables currently in use are consulted, except for the one
we have just flushed. That provides both the guarantee that we won't check our
own SSTable and access to the most up-to-date SSTable list.
Fixes#1364
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>
Add contiguity flag to cache entry and set it in scanning reader.
Partitions fetched during scanning are continuous
and we know there's nothing between them.
Clear contiguity flag on cache entries
when the succeeding entry is removed.
Use continuous flag in range queries.
Don't go do disk if we know that there's nothing
between two entries we have in cache. We know that
when continuous flag of the first one is set to true.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <72bae432717037e95d1ac9465deaccfa7c7da707.1466627603.git.piotr@scylladb.com>
If we don't, std::terminate() causes a core dump, even though an
exception is sort-of-expected here and can be handled.
Add an exception handler to fix.
Fixes#1379.
Message-Id: <1466595221-20358-1-git-send-email-avi@scylladb.com>
gdb gets confused if a non-fully-qualified class name is used when
we are in some namespace context. Help it out by adding a :: prefix.
Message-Id: <1466587895-8690-1-git-send-email-avi@scylladb.com>
This patchset adds two new types of query limits:
- Per partition row limit, which limits how many rows
a given partition may return; needed both for thrift
and for future CQL features;
- Limit on the number of partitions returned, needed
by thrift.
This patch renames compact_query::_partition_limit to
_current_partition_limit for clarity, as the next patch adds
a partition limit that limits the number of partitions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch renames compact_query::_limit to _row_limit for
clarity, as a subsequent patch introduces yet another limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the way we fetch each replica's last row to
determine if we got incomplete information from any of them. Instead
of fetching the last rows up front, we fetch them on demand only if we
actually trigger the code that needs them. We now get the last row from
the versions vector of vectors.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts to a function the code that actually determines
the last row of a partition based on the direction of the query.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since systemd moved to PermissionsStartOnly, only upstart uses sudoers.
So move common/sudoers.d to dist/ubuntu, drop them from .rpm.
Also, Ubuntu 15.10/16.04 does not requires sudoers since these are uses systemd.
So copy sudoers only for 14.04.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1466536491-9860-1-git-send-email-syuu@scylladb.com>
* seastar 401c333...3029ebe (3):
> util: add a seastar::value_of() helper function
> rpc: force closing listen fd on server stop
> reactor: fix I/O priority class id assignment
From Paweł:
This series introduces streaming_mutations which allow mutations to be
streamed between the producers and the consumers as a series of
mutation_fragments. Because of that the mutation streaming interface
works well with partitions larger than available memory provided that
actual producer and consumer implementations can support this as well.
mutation_fragments are the basic objects that are emitted by
streamed_mutations they can represent a static row, a clustering row,
the beginning and the end of a range tombstone. They are ordered by their
clustering keys (with static rows being always the first emitted mutation
fragment). The beginning of range tombstone is emitted before any
clustering row affected by that tombstone and the end of range tombstone
is emitted after the last clustering row affected by it. Range tombstones
are disjoint.
In this series all producers are converted to fully support the new
interface, that includes cache, memtables and sstables. Mutation queries
and data queries are the only consumers converted so far.
To minimize the per-mutation_fragment overhead streamed_mutations use
batching. The actual producer implementation fills a buffer until
it is full (currently, buffer size is 16, the limit should, however,
be changed to depend on the actual size in memory of the stored elements)
or end of stream is reached.
In order to guarantee isolation of writes reads from cache and memtable
use MVCC. When a reader is created it takes a snapshot of the particular
cache or memtable entry. The snapshot is immutable and if there happen
to be any incoming writes while the read is active a new version of
partition is created. When the snapshot is destroyed partition versions
are merged together as much as possible.
Performance results with perf_simple_query (median of results with
duration 15):
before after diff
write 618652.70 618047.58 -0.10%
read 661712.44 608070.49 -8.11%
This reverts commit 2d7f8f4a47.
Avi sayeth:
"Isn't this the other way round? EBS is persistent."
and
"The patch is wrong too. Instance store takes 5 minutes to boot
compared to 1 minute for EBS."
From Glauber:
This is my new take at the "Move throttler to the LSA" series, except
this one don't actually move anything anywhere: I am leaving all
memtable conversion out, and instead I am sending just the LSA bits +
LSA active reclaim. This should help us see where we are going, and
then we can discuss all memtable changes in a series on its own,
logically separated (and hopefully already integrated with virtual
dirty).
[tgrabiec: trivial merge conflicts in logalloc.cc]
If a CF does not have any sstables at all, we should treat it
as having a replay position of zero. However, since we also
must deal with potential re-sharding, we cannot just set
shard->uuid->zero initially, because we don't know what shards
existed.
Go through all CF:s post map-reduce, and for every shard where
a CF does not have an RP-mapping (no sstables found), set the
global min pos (for shard) to zero.
Fixes#1372
Message-Id: <1465991864-4211-1-git-send-email-calle@scylladb.com>
Try to emulate the origin behaviour for batch reply. They use an
explicit write handler, combinging
1.) Hinting to all known dead endpoints
2.) Sending to all persumed live, requiring ack from all
3.) Hinting to endpoint to which send failed.
We don't have hints, so try to work around by doing send with
cl=ALL, and if send fails (wholly or partially), retain the
batch in the log.
This is still slight behavioural difference, and we also risk
filling up the batch log in extreme cases. (Though probably not
in any real environment).
Refs #1222
Message-Id: <1466444170-23797-1-git-send-email-calle@scylladb.com>
We now keep the regions sorted by size, and the children region groups as well.
Internally, the LSA has all information it needs to make size-based reclaim
decisions. However, we don't do reclaim internally, but rather warn our user
that a pressure situation is mounted.
The user of a region_group doesn't need to evict the largest region in case of
pressure and is free to do whatever it chooses - including nothing. But more
likely than not, taking into account which region is the largest makes sense.
This patch puts together this last missing piece of the puzzle, and exports the
information we have internally to the user.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Region is implemented using the pimpl pattern (region_impl), and all its
relevant data is present in a private structure instead of the region itself.
That private structure is the one that the other parts of the LSA will refer to,
the region_group being the prime example. To allow classes such as the
region_group the externally export a particular region, we will introduce a
backpointer region_impl -> region.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are currently just allowing the region_group to specify a throttle_threshold,
that triggers throttling when a certain amount of memory is reached. We would
like to notify the callers that such condition is reached, so that the callers
can do something to alleviate it - like triggering flushes of their structures.
The approach we are taking here is to pass a reclaimer instance. Any user of a
region_group can specialize its methods start_reclaiming and stop_reclaiming
that will be called when the region_group becomes under pressure or ceases to
be, respectively.
Now that we have such facility, it makes more sense to move the
throttle_threshold here than having it separately.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we decide to evict from a specific region_group due to excessive memory
usage, we must also consider looking at each of their children (subgroups). It
could very well be that most of memory is used by one of the subgroups, and
we'll have to evict from there.
We also want to make sure we are evicting from the biggest region of all, and
not the biggest region in the biggest region_group. To understand why this is
important, consider the case in which the regions are memtables associated with
dirty region groups. It could be that a very big memtable was recently flushed,
and a fairly small one took its place. That region group is still quite large
because the memtable hasn't finished flushing yet, but that doesn't mean we
should evict from it.
To allow us to efficiently pick which region is the largest, each root of each
subtree will keep track of its maximal score, defined as the maximum between our
largest region total_space and the maximum maximal score of subtrees.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, the regions in a region group are organized in a simple vector.
We can do better by using a binomial heap, as we do for segments, and then
updating when there is change. Internally to the LSA, we are in good position
to always know when change happens, so that's really the best way to do it.
The end game here, is to easily call for the reclaim of the largest offending
region (potentially asynchronously). Because of that, we aren't really interested
in the region occupancy, but in the region reclaimable occuppancy instead: that's
simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing
that effectively lists all non reclaimable regions in the end of the heap, in no
particular order.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The database code uses a throttling function to make sure that memory
used for the dirty region never is over the limit. We track that with
a region group, so it makes sense to move this as generic functionality
into LSA.
This patch implements the LSA-side functionality and a later patch will
convert the current memtable throttler to use it.
Unlike the current throttling mechanism, we'll not use a timer-based
mechanism here. Aside from being more generic and friendlier towards
other users, this is a good change for current memtable by itself.
The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we
would be better off without them. Let's discuss the merits of each separately:
1) 10ms timer: If we are throttling, we expect somebody to flush the memtables
for memory to be released. Since we are in position to know exactly when a memtable
was written, thus releasing memory, we can just call unthrottle at that point, instead
of using a timer.
2) 1MB release threshold: we do that because we have no idea how much memory a request
will use, so we put the cut somehow. However, because of 1) we don't call unthrottle
through a timer anymore, and do it directly instead. This means that we can just execute
the request and see how much memory it has used, with no need to guess. So we'll call
unthrottle at the end of every request that was previously throttled.
Writing the code this way also has the advantage that we need one less continuation in
the common case of the database not being throttled.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
compact_for_query is an intermediate stage used to compact data in a
flattened stream of mutations before they are consumed by query building
consumers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Mutation reader produces a stream of streamed_mutations. Each
streamed_mutation itself is a stream so basically we are dealing here
with a stream of streams.
consume_flattened() flattens such stream of streams making all its
elements consumable by a single consumer. It also allows reversing
the mutations before consumption using reverse_streamed_mutation().
reverse_streamed_mutation() is an inefficient way of reversing
streamed_mutations. First, it collects all mutation_fragments and then
it emits them in the reversed orders (except static row which always is
the first element and it also flips the bounds of range tombstones).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
flip_bound_kind() changes start bound to end bound and vice versa while
preserving the inclusivness.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
To ensure isolation of operation when streaming a mutation from a
mutable source (such as cache or memtable) MVCC is used.
Each entry in memtable or cache is actually a list of used versions of
that entry. Incoming writes are either applied directly to the last
verion (if it wasn't being read by anyone) or preprended to the list
(if the former head was being read by someone). When reader finishes it
tries to squash versions together provided there is no other reader that
could prevent this.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Originally, ranges for reversed queries were in descending order and
ranges for forward queries in ascending order. However,
streamed_mutations require them to always be in ascending order.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The main user of this list is MVCC implementation in partition_version.cc.
The reason why boost::intrusive::list<> cannot be used is that tere is no
single owner of the list who could keep boost::intrusive::list<> object
alive. In the MVCC case there is at least one partition_entry object and
possibly multiple partition_snapshot objects which lifetime is independent
and the list must remain in a valid state as long as at least one of them
is alive.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
It is incorrect to update row_cache with a memtable that is also its
underlying storage. The reason for that is that after memtable is merged
into row_cache they share lsa region. Then when there is a cache miss
it asks underlying storage for data. This will result with memtable
reader running under row_cache allocation section. Since memtable reader
also uses allocation section the result is an assertion fault since
allocation sections from the same lsa region cannot be nested.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
With streamed_mutations a partition with many small rows doesn't stress
the cache as much as the test expects. Use large clustering rows instead.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
range_tombstone_stream encapsulates logic responsible for turning
range_tombstone_list into a stream of mutation_fragments and merging
that stream with a stream of clustering rows.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Row markers and collections weren't filtered out even if they belonged
to a clustering row that shouldn't be in the result. The check whether
to include cell or not was done only for live and dead atomic cells.
This patch adds appropriate checks for collections and row markers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
mutation_reader and streamed_mutation may use the same stream as a source
mutation_fragments and mutations themselves (this happens in sstable reader).
In such case asking for next streamed_mutation from mutation_reader would
invalidate all other streamed_mutations.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
streamed_mutation represents a mutation in a form of a stream of
mutation_fragments. streamed_mutation emits mutation fragments in the
order they should appear in the sstables, i.e. static row is always
the first one, then clustering rows and range tombstones are emitted
according to the lexicographical ordering of their clustering keys and
bounds of the range tombstones.
Range tombstones are disjoint, i.e. after emitting
range_tombstone_begin it is guaranteed that there is going to be a
single range_tombstone_end before another range_tombstone_begin is
emitted.
The ordering of mutation_fragments also guarantees that by the time
the consumer sees a clustering row it has already received all
relevant tombstones.
Partition key and partition tombstone are not streamed and is part of
the streamed_mutation itself.
streamed_mutation uses batching. The mutation implementations are
supposed to fill a buffer with mutation fragments until is_buffer_full()
or the end of stream is encountered.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This commit introduces mutation_fragment class which represents the parts
of mutation streamed by streamed_mutation.
mutation_fragment can be:
- a static row (only one in the mutation)
- a clustering row
- start of range tombstone
- end of range rombstone
There is an ordering (implemented in position_in_partition class) between
mutation_fragment objects. It reflects the order in which content of
partition appears in the sstables.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
After commit faa4581, each shard only starts splitting its shared sstables
after opening all sstables. This was important because compaction needs to
be aware of all sstables.
However, another bug remained: If one shard finishes loading its sstables
and starts the splitting compactions, and in parallel a different shard is
still opening sstables - the second shard might find a half-written sstable
being written by the first shard, and abort on a malformed sstable.
So in this patch we start the shared sstable rewrites - on all shards -
only after all shards finished loading their sstables. Doing this is easy,
because main.cc already contains a list of sequential steps where each
uses invoke_on_all() to make sure the step completes on all shards before
continuing to the next step.
Fixes#1371
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466426641-3972-1-git-send-email-nyh@scylladb.com>
Using an ordering mechanism better than rw-locks for write/flush
means we can wait for pending write in batch mode, and coalesce
data from more than one mutation into a chunk.
It also means we can wait for a specific read+flush pair (based on
file position).
Downside is that we will not do parallel writes in batch mode (unless
we run out of buffer), which might underutilize the disk bandwidth.
Upside is that running in batch mode (i.e. per-write consistency)
now has way better bandwidth, and also, at least with high mutation
rate, better average latency.
Message-Id: <1465990064-2258-1-git-send-email-calle@scylladb.com>
Scylla will not start if the disk was not benchmarked
so start run io_tune with the right parameters.
Also add the cpu_set environment variables for passing
cpu set to iotune and scylla.
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466412846-4760-2-git-send-email-benoit@scylladb.com>
Use the PermissionsStartOnly systemd option to apply the permission
related configurations only to the start command. This allows us to stop
using "sudo" for ExecStartPre and ExecStopPost hooks and drop the
"requiretty" /etc/sudoers hack from Scylla's RPM.
Tested-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1466407587-31734-1-git-send-email-penberg@scylladb.com>
Because we build on CentOS 7, which does not have the %sysctl_apply macro,
the macro is not expanded, and therefore executed incorrectly even on 7.2,
which does.
Fix by expanding the macro manually.
Fixes#1360.
Message-Id: <1466250006-19476-1-git-send-email-avi@scylladb.com>
Since most of the time people are running scylla_setup on
a fully upgraded ubuntu 14.04 box, we rarely reach that
code path, but once we do we end up with an error. Let's
fix that.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Message-Id: <1466205496-3885-2-git-send-email-lmr@scylladb.com>
Starting in commit 721f7d1d4f, we start "rewriting" a shared sstable (i.e.,
splitting it into individual shards) as soon as it is loaded in each shard.
However as discovered in issue #1366, this is too soon: Our compaction
process relies in several places that compaction is only done after all
the sstables of the same CF have been loaded. One example is that we
need to know the content of the other sstables to decide which tombstones
we can expire (this is issue #1366). Another example is that we use the
last generation number we are aware of to decide the number of the next
compaction output - and this is wrong before we saw all sstables.
So with this patch, while loading sstables we only make a list of shared
sstables which need to be rewritten - and the actual rewrite is only started
when we finish reading all the sstables for this CF. We need to do this in
two cases: reboot (when we load all the existing sstables we find on disk),
and nodetool referesh (when we import a set of new sstables).
Fixes#1366.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466344078-31290-1-git-send-email-nyh@scylladb.com>
Commit daad2eb "row_cache: fix memory leak in case of schema upgrade
failure" has fixed a memory leak caused by failed upgrade_entry().
However, in case of upgrade failure memtable_entry used to create the
new cache entry was left in some invalid state. If the operation was
retried the cache would attempt again to apply that memtable_entry which
now would be in invalid state.
The solution is to either to ignore upgrade_entry() exceptions or do not
call it at all and let the cache entry be upgraded on demand. This patch
implements the latter.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1466163435-27367-1-git-send-email-pdziepak@scylladb.com>
When update() causes a new entry to be inserted to the cache the
procedure is as follows:
1. allocate and construct new entry
2. upgrade entry schema
3. add entry to lru list and cache tree
Step 2 may fail and at this point the pointer to the entry is neither
protected by RAII nor added in any of the cache containers. The solution
is to swap steps 2 and 3 so that even if the upgrade fails the entry is
already owned by the cache and won't leak.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1466161709-25288-1-git-send-email-pdziepak@scylladb.com>
We want to prevent older version of scylla which has fewer features to
join a cluster with newer version of scylla which has more features,
because when scylla sees a feature is enabled on all other nodes, it
will start to use the feature and assume existing nodes and future nodes
will always have this feature.
In order to support downgrade during rolling upgrade, we need to support
mixed old and new nodes case.
1) All old nodes
O O O O O <- N OK
O O O O O <- O OK
2) All new nodes
N N N N N <- N OK
N N N N N <- O FAIL
3) Mixed old and new nodes
O N O N O <- N OK
O N O N O <- O OK
(O == old node, N == new node, <- == joining the cluster)
With this patch, I tested:
1.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}
1.2) Add old node to old node cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}
2.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}
2.2) Add old node to new node cluster
seastar - Exiting on unhandled exception: std::runtime_error (Feature
check failed. This node can not join the cluster because it does not
understand the feature. Local node 127.0.0.4 features = {}, Remote
common_features = {RANGE_TOMBSTONES})
3.1) Add new node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {}
3.2) Add old node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}
Fixes#1253
If the feature string is empty, boost::split will return
std::set<sstring> = {""} instead of std::set<sstring> = {}
which will make a node with a feaure, e.g. std::set<sstring> =
{"RANGE_TOMBSTONES"}, think it does not understand the feature of
a node with no features at all.
User may specify time after which speculative retry should happen
instead of relying on cf statics. Use provided value in speculative
executor.
Message-Id: <20160616104422.GH5961@scylladb.com>
Currently, we only stop the CQL transport server. Extract a
stop_transport() function from drain_on_shutdown() and call it from
do_isolate_on_error() to also shut down the inter-node RPC transport,
Thrift, and other communications services.
Fixes#1353
"Reclaiming many segments was observed to cause up to multi-ms
latency. With the new setting, the latency of reclamation cycle with
full segments (worst case mode) is below 1ms.
I saw no difference in throughput in a CQL write micro benchmark
in neither of these workloads:
- full segments, reclaim by random eviction
- sparse segments (3% occupancy), reclaim by compaction and no eviction
Fixes #1274."
Commit f42673ed1e ("scylla_setup: Hide
busy block devices from RAID0 configuration") wasn't enumerating
anything. Additionally it listed from /dev/ and not /dev/dm which broke
the tests conditions.
This one uses blkid instead of /proc/partitions.
A follow up patch will be required to mask encrypted devices.
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466059657-12377-1-git-send-email-benoit@scylladb.com>
Allocations will still be allowed if made directly, but callers will have the
choice (in an upcoming patch) to proceed only if memory is below this threshold.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Tomek correctly points out that since we are now using "this" in lambda
captures, we should make the region_group not movable. We currently define a
move constructor, but there are no users. So we should just remove them.
copy constructor is already deleted, and so are the copy and move assignment
operators. So by removing the move constructor, we should be fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch look in /proc/mount for the device name so
the device or it's subdevices will be excluded from the availables
RAID0 targets. It does the same with physical volume from device
mapper.
Fixes#1189
Message-Id: <1466001423-9547-4-git-send-email-benoit@scylladb.com>
is_atomic() is called for each cell in mutation applies, compaction
and query. Since the value doesn't change it can be easily cached which
would save one indirection and virtual call.
Results of perf_simple_query -c1 (median, duration 60):
before after
read 54611.49 55396.01 +1.44%
write 65378.92 68554.25 +4.86%
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1465991045-11140-1-git-send-email-pdziepak@scylladb.com>
On NUMA hardware, autonuma may reduce performance by
unmapping memory.
Since we do manual NUMA placement, autonuma will not
help anything.
We ought to disable it by setting the kernel.numa_balancing
sysctl to 0.
Fixes: #1120
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466006345-9972-1-git-send-email-benoit@scylladb.com>
dtest takes error level log as serious error. It is not a serious error
for streaming to fail to send a verb and fail a streaming session which
triggers a repair failure, for example, the peer node is gone or
stopped. Switch to use log level warn instead of level error.
Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test
Fixes: #1335
Message-Id: <406fb0c4a45b81bd9c0aea2a898d7ca0787b23e9.1465979288.git.asias@scylladb.com>
dtest takes error level log as serious error. It is not a serious error
for streaming to fail to send a verb and fail a streaming session, for
example, the peer node is gone or stopped. Switch to use log level warn
instead of level error.
Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test
Fixes: #1335
Message-Id: <0149d30044e6e4d80732f1a20cd20593de489fc8.1465979288.git.asias@scylladb.com>
Reclaiming many segments was observed to cause up to multi-ms
latency. With the new setting, the latency of reclamation cycle with
full segments (worst case mode) is below 1ms.
I saw no decrease in throughput compared to the step of 16 segments in
neither of these modes:
- full segments, reclaim by random evicition
- sparse segments (3% occupancy), reclaim by compaction and no eviction
Fixes#1274.
push_back() is not reentrant with pop_front(), used by the evictor. If
reclaimer runs when std::deque allocates a new node it will get
corrupted. Fix by runnning push_back() under reclaim lock.
tracing::tracing local instance is dereferenced from a
cql_server::connection::process_request(), therefore tracing::tracing
service may be stop()ed only after a CQL server service is down.
On the other hand it may not be stopped before RPC service is down
because a remote side may request a tracing for a specific command too.
This patch splits the tracing::tracing stop() into two phases:
1) Flush all pending tracing records and stop the backend.
2) Stop the service.
The first phase is called after CQL server is down and before RPC is down.
The second phase is called after RPC is down.
Fixes#1339
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465840496-19990-1-git-send-email-vladz@cloudius-systems.com>
* seastar 864d6dc...401c333 (8):
> scollectd: Support filtering specific collectd metrics
> core: Integrate error reporting with the logging framework
> rpc: wait for all replies to be completed before closing rpc server
> rpc: clean up resource accounting
> queue: fix race between pop_eventually() and abort()
> rpc_test: fix cancel test to not depend on timing.
> tutorial: explain application-specific command line options
> add ostream output operator for std::unordered_map
When read repair writes diffs back to replicas it is enough to wait
for requested CL to guaranty read monotonicity. This patch makes read
repair write reuse regular mutate functionality which already tracks
CL status. This is done by changing write response handler to not hold
mutation directly, but instead hold a container that, depending on
whether
this is read repair write or regular one, can provide different mutation
per destination.
Message-Id: <20160613124727.GL1096@scylladb.com>
query_state expects the current row limit to be updated so it
can be enforced across partition ranges. A regression introduced
in e4e8acc946 prevented that from
happening by passing a copy of the limit to querying_reader.
This patch fixes the issue by having column_family::query update
the limit as it processes partitions from the querying_reader.
Fixes#1338
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1465804012-30535-1-git-send-email-duarte@scylladb.com>
"Correctness of current uses of clear() and invalidate() relies on fact
that cache is not populated using readers created before
invalidation. Sstables are first modified and then cache is
invalidated. This is not guaranteed by current implementation
though. As pointed out by Avi, a populating read may race with the
call to clear(). If that read started before clear() and completed
after it, the cache may be populated with data which does not
correspond to the new sstable set.
To provide such guarantee, invalidate() variants were adjusted to
synchronize using _populate_phaser, similarly like row_cache::update()
does.
Fixes #1291."
There are various call-sites that explicitly check for EEXIST and
ENOENT:
$ git grep "std::error_code(E"
database.cc: if (e.code() != std::error_code(EEXIST, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) {
sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) {
Commit 961e80a ("Be more conservative when deciding when to shut down
due to disk errors") turned these errors into a storage_io_exception
that is not expected by the callers, which causes 'nodetool snapshot'
functionality to break, for example.
Whitelist the two error codes to revert back to the old behavior of
io_check().
Message-Id: <1465454446-17954-1-git-send-email-penberg@scylladb.com>
Make storage_io_exception exception error message less cryptic by
actually including the human-readable error message from
std::system_error...
Before:
nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage io error errno: 2
After:
nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage I/O error: 2: No such file or directory
We can improve this further by including the name of the file that the I/O
error happened on.
Message-Id: <1465452061-15474-1-git-send-email-penberg@scylladb.com>
Several shards may share the same sstable - e.g., when re-starting scylla
with a different number of shards, or when importing sstables from an
external source. Sharing an sstable is fine, but it can result in excessive
disk space use because the shared sstable cannot be deleted until all
the shards using it have finished compacting it. Normally, we have no idea
when the shards will decide to compact these sstables - e.g., with size-
tiered-compaction a large sstable will take a long time until we decide
to compact it. So what this patch does is to initiate compaction of the
shared sstables - on each shard using it - so that a soon as possible after
the restart, we will have the original sstable is split into separate
sstables per shard, and the original sstable can be deleted. If several
sstables are shared, we serialize this compaction process so that each
shard only rewrites one sstable at a time. Regular compactions may happen
in parallel, but they will not not be able to choose any of the shared
sstables because those are already marked as being compacted.
Commit 3f2286d0 increased the need for this patch, because since that
commit, if we don't delete the shared sstable, we also cannot delete
additional sstables which the different shards compacted with it. For one
scylla user, this resulted in so much excessive disk space use, that it
literally filled the whole disk.
After this patch commit 3f2286d0, or the discussion in issue #1318 on how
to improve it, is no longer necessary, because we will never compact a shared
sstable together with any other sstable - as explained above, the shared
sstables are marked as "being compacted" so the regular compactions will
avoid them.
Fixes#1314.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Previously, we were using a stat to decide if compaction should be
retried, but that's not efficient. The information is also lost
after node is restarted.
After these changes, compaction will be retried until strategy is
satisfied, i.e. there is nothing to compact.
We will now be doing the following in a loop:
Get compaction job from compaction strategy.
If cannot run, finish the loop.
Otherwise, compact this column family.
Go back to start of the loop.
By the way, pending_compactions stat will be deprecated after this
commit. Previously, it was increased to indicate the want for
compaction and decreased when compaction finished. Now, we can
compact more than we asked for, so it would be decreased below 0.
Also, it's the strategy that will tell the want for compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <899df0d8d807f6b5d9bb8600d7c63b4e260cc282.1465398243.git.raphaelsc@scylladb.com>
Sometimes a metric previously reported from collectd is not available
anymore. Previously, this caused scyllatop to log and exception to the
user - which in effect destroyes the user experience and inhibits
monitoring other metrics. This patch makes ScyllaTop handle this
problem. It will display such metrics and 'not available', and exclude
them from some and average computations.
Closes issue #1287.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1465301178-27544-1-git-send-email-yoav@scylladb.com>
From Asias:
In f27e5d2a6 (messaging_service: Delay listening ms during boot up),
messaging_service startup is splitted into two stages. Adjust the api
registration code and fix up the messaging_service stop code.
This patch makes a few minor improvements in the parser:
- merge first and rest into 2-argument form of Word to define
identifier – should give some performance boost, simpler code
- replace Literal(keyword_string) with Keyword(keyword_string)
throughout - stricter parsing, avoids misinterpreting identifiers
with keywords
- replace expr.setResultsName("name") with expr("name") throughout –
this is a style change (no actual change in underlying parser
behavior), but I find this form easier to follow
- add calls to setName to make exceptions more readable
Message-Id: <005901d1bbd2$711f7bb0$535e7310$@austin.rr.com>
There are two problems:
1. _server_tls is not stopped
2. _server and _server_tls might not be created if
messaging_service::start_listen is not called yet.
Since messaging_service is fully initialized in
storage_service::init_server which calls
messaging_service::start_listen, we need to delay
the messaging_service api registration after it.
The rate_moving_average is used by timed_rate_moving_average to return
its internal values.
If there are no timed event, the mean_rate is not propertly initilized.
To solve that the mean_rate is now initilized to 0 in the structure
definition.
Refs #1306
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1465231006-7081-1-git-send-email-amnon@scylladb.com>
This variable if set to true will activate
developer mode. It will be set by using the
-e option of docker run.
The xfs bind mount behavior and the cpuset behavior
will be set by using the relevant docker command
lines options and documented in the scylla/docker
howto.
Fixes: #1267
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1465213713-2537-1-git-send-email-benoit@scylladb.com>
Add a support for defining a probability (a value in a [0,1] range)
for tracing the next CQL request.
Traces for requests that are chosen to be traced due to this feature
are not going to flushed immediately.
Use std::subtract_with_carry_engine (implements the "lagged Fibonacci" algorithm)
random number engine for fastest generation of random integer values.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Correctness of current uses of clear() and invalidate() relies on fact
that cache is not populated using readers created before
invalidation. Sstables are first modified and then cache is
invalidated. This is not guaranteed by current implementation
though. As pointed out by Avi, a populating read may race with the
call to clear(). If that read started before clear() and completed
after it, the cache may be populated with data which does not
correspond to the new sstable set.
To provide such guarantee, invalidate() variants were adjusted to
synchronize using _populate_phaser, similarly like row_cache::update()
does.
A tracing session life cycle includes 3 stages:
1) Active: when new trace records are being added to this session.
2) Pending for flushing to a storage: when session is over but not
yet flushed to the storage ("backend").
3) Flushing: when session's records are being flushed to the storage
and this process is not yet completed.
Sessions may accumulate in each of the stages above and we should limit
the maximum amount of sessions being accumulated in each of them in order to avoid OOM
situation.
Current in-tree implementation only limits the number of tracing sessions
accumulated in the first ("Active") stage.
Since currently every closing session is being immediately flushed (as long
as "settraceprobability" is not implemented) the second stage never accumulates
tracing sessions.
The third stage is currently not controlled at all and if, for instance, we
succeed to push enough tracing session towards a slow storage backend, they may
accumulate there consuming an uncontrolled amount of memory and may eventually consume
all of it.
This patch fixes this unpleasant situation by implying the following strategy:
- Limit the total amount of accumulated tracing sessions in all stages above together
by a static value - 2 times "flush threshold". "2 times" is needed to allow new
tracing sessions to accumulate in the stage 2 while sessions in the stage 3 are still
being processed.
- Forcefully flush sessions in the stage 2 to the storage when their count reaches a "flush
threshold".
This would ensure that there will not more than totally (2 * "flush threshold") sessions (in any stage)
on each shard.
An advantage of this strategy is its simplicity - we only need a single threshold to control all stages.
If we feel that we needed a finer graining for each stage we may add separate limits for each of them
in the future.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
* dist/ami/files/scylla-ami 72ae258...863cc45 (3):
> Move --cpuset/--smp parameter settings from scylla_sysconfig_setup to scylla_ami_setup
> convert scylla_install_ami to bash script
> 'sh -x -e' is not valid since all scripts converted to bash script, so remove them
Call for a tracing::tracing::create_session() doesn't promise a session creation.
Check that the session is actually created before trying to use it.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Currently we only shut down on EIO. Expand this to shut down on any
system_error.
This may cause us to shut down prematurely due to a transient error,
but this is better than not shutting down due to a permanent error
(such as ENOSPC or EPERM). We may whitelist certain errors in the future
to improve the behavior.
Fixes#1311.
Message-Id: <1465136956-1352-1-git-send-email-avi@scylladb.com>
It was discussed that leveled strategy may not benefit from parallel
compaction feature because almost all compaction jobs will have similar
size. It was also found that leveled strategy wasn't working correctly
with it because two overlapping sstable (targetting the same level)
could be created in parallel by two ongoing compaction.
Fixes#1293.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <60fe165d611c0283ca203c6d3aa2662ab091e363.1464883077.git.raphaelsc@scylladb.com>
From Duarte:
This patchset adds the range_tombstone_list data structure,
used to hold a set of disjoint range tombstones, and changes
the internal representation of row tombstones to use that
data structure.
Fixes#1155
[tgrabiec: Added compound_wrapper::make_empty(const schema&) overload
to fix compilation failure in tracing code]
This patch enables the RANGE_TOMBSTONES supported feature, meaning
that the node is capable of accepting row entry tombstones as range
tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch uses the composite_marker to add inclusiveness information
to the prefixes of a range tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since Scylla now supports proper range tombstones, the code for
reading ranges from sstables and converting them to overlapping
tombstones is no longer necessary, and is, in fact, wasteful as
the internal representation converts overlapping tombstones back to
ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves the difference between two mutation_partition's
row_tombstones inside the range_tombstone_list.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.
Fixes#1155
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the range tombstones feature, which is not enabled
yet, to the storage_service, so that consumers can query for it.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the gms::feature destructor so it
checks whether the gossiper has been stopped before trying
to unregister the feature.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the code from sstables/partition.cc which is used
to transform a set of range tombstones into a set of overlapping
scylladb tombstones.
The range_tombstone_merger will be used to send mutations to nodes not
yet updated to support the internal range tombstone representation.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This class is responsible for representing a set of range tombstones
as non-overlapping disjoint sets of range tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the range_tombstone class, composed of
a [start, end] pair of clustering_key_prefixes, the type
of inclusiveness of each bound, and a tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the idl-compiler so that the default value of a
field can be set to the value of a previous field in the class:
class P {
uint32_t x;
uint32_t y = x;
};
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
... and make it a clustering_key_prefix, in preparation of
supporting not-whole-row range tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Config provides operators << >> for string_map which makes it impossible
to have generic stream operators for unordered_map. Fix it by making
string_map a separate type and not just an alias.
Message-Id: <20160602102642.GJ9939@scylladb.com>
Limit disk bandwidth to 5MB/s to emulate a slow disk:
echo "8:0 5000000" >
/cgroup/blkio/limit/blkio.throttle.write_bps_device
echo "8:0 5000000" >
/cgroup/blkio/limit/blkio.throttle.read_bps_device
Start scylla node 1 with low memory:
scylla -c 1 -m 128M --auto-bootstrap false
Run c-s:
taskset -c 7 cassandra-stress write duration=5m cl=ONE -schema
'replication(factor=1)' -pop seq=1..100000 -rate threads=20
limit=2000/s -node 127.0.0.1
Start scylla node 2 with low memory:
scylla -c 1 -m 128M --auto-bootstrap true
Without this patch, I saw std::bad_alloc during streaming
ERROR 2016-06-01 14:31:00,196 [shard 0] storage_proxy - exception during
mutation write to 127.0.0.1: std::bad_alloc (std::bad_alloc)
...
ERROR 2016-06-01 14:31:10,172 [shard 0] database - failed to move
memtable to cache: std::bad_alloc (std::bad_alloc)
...
To fix:
1. Apply the streaming mutation limiter before we read the mutation into
memory to avoid wasting memory holding the mutation which we can not
send.
2. Reduce the parallelism of sending streaming mutations. Before we send each
range in parallel, after we send each range one by one.
before: nr_vnode * nr_shard * (send_info + cf.make_reader memory usage)
after: nr_shard * (send_info + cf.make_reader memory usage)
We can at least save memory usage by the factor of nr_vnode, 256 by
default.
In my setup, fix 1) alone is not enough, with both fix 1) and 2), I saw
no std::bad_alloc. Also, I did not see streaming bandwidth dropped due
to 2).
In addition, I tested grow_cluster_test.py:GrowClusterTest.test_grow_3_to_4,
as described:
https://github.com/scylladb/scylla/issues/1270#issuecomment-222585375
With this patch, I saw no std::bad_alloc any more.
Fixes: #1270
Message-Id: <7703cf7a9db40e53a87f0f7b5acbb03fff2daf43.1464785542.git.asias@scylladb.com>
"This series introduces a tracing infrastructure that may be used
for tracing CQL commands execution and measuring latencies of separate
stages of CQL handling as defined by a CQL binary protocol specification.
To begin tracing one should create a "tracing session", which may then
be used to issuing tracing events.
If execution of a specific CQL command involves other Nodes (not only a Coordinator),
then a "tracing session ID" is passed to that Node (in the context of the
corresponding RPC call). Then this "session ID" may be used to create a
"secondary tracing session" to issue tracing events in the context of the original session.
The series contains an implementation of tracing that uses a keyspace in the current
cluster for storing tracing information.
This series contains a demo per-request tracing instrumentation of a QUERY
CQL command and even this instrumentation is partial: it only fully instruments
a QUERY->SELECT->read_data call chain.
This is by all means a very beginning of the proper instrumentation which is
to come.
Right now the latencies for a single SELECT for a single raw with RF 1 from a 2 Nodes cluster
on my laptop started using ccm (for C* all default parameters, for scylla - memory 256MB, --smp 2)
are as follows (pseudo-graphics warning):
--------------------------------------------------------------------------------------------
| scylla (2 Nodes x 2 shards each) | C* 2.1.8
_______________________________________|___________________________________|________________
Coordinator and replica are same Node | |
(TRACING OFF): | 0.3ms | 0.3ms
c-s with a single thread mean latency | (was 0.2ms before the last |
value | rebase with a master) |
--------------------------------------------------------------------------------------------
Coordinator and replica are same Node | |
(TRACING ON) | ~250us | ~1200us
Running a SELECT command from a cqlsh | |
a few times | |
--------------------------------------------------------------------------------------------
Coordinator and replica are not on the | |
same Node | ~700us | >2500us
(TRACING ON) | |
--------------------------------------------------------------------------------------------
To begin tracing one may use a cqlsh "TRACING ON/OFF" commands:
cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> select "C0", "C1" from keyspace1.standard1 where key=0x12345679;
C0 | C1
--------------------+------
0x000000000001e240 | null
(1 rows)
Tracing session: 146f0180-21e7-11e6-b244-000000000000
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------+----------------------------+-----------+----------------
select "C0", "C1" from keyspace1.standard1 where key=0x12345679; | 2016-05-24 22:38:24.536000 | 127.0.0.1 | 0
message received from /127.0.0.1 [0] | 2016-05-24 22:38:24.537000 | 127.0.0.2 | --
Done reading options [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 3
read_data handling is done [0] | 2016-05-24 22:38:24.537000 | 127.0.0.2 | 37
Parsing a statement [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 3
Processing a statement [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 56
Done processing - preparing a result [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 550
Request complete | 2016-05-24 22:38:24.536560 | 127.0.0.1 | 560
cqlsh>"
This is a demo instrumentation:
- Check if a tracing info is present in the read_command.
- If yes - create a tracing session with the given tracing
session ID.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Instrument a coordinator of a SELECT query to send tracing session
info to the corresponding replica Nodes.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Store a trace state inside a client_state.
- Start tracing in a cql_server::connection::process_query().
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Add a tracing ID (UUID) optional field to cql_server::response.
- If _tracing_id is set make_frame() would insert a tracing ID
in the response message. According to CQL spec it should be the
first thing in the response "body" and the TRACING bit (0x02) should be
set in the "flags" field.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When client_state is created with an external_tag - store
a client address in the client state.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
trace_state: Is a single tracing session.
tracing: A sharded service that contains an i_trace_backend_helper instance
and is a "factory" of trace_state objects.
trace_state main interface functions are:
- begin(): Start time counting (should be used via tracing::begin() wrapper).
- trace(): Create a tracing event - it's coupled with a time passed since begin()
(should be used via tracing::trace() wrapper).
- ~trace_state(): Destructor will close the tracing session.
"tracing" service main interface function is:
- start(): Initialize a backend.
- stop(): Shut down a backend.
- create_session(): Creates a new tracing session.
(tracing::end_session(): Is called by a trace_state destructor).
When trace_state needs to store a tracing event it uses a backend helper from
a "tracing" service.
A "tracing" service limits a number of opened tracing session by a static number.
If this number is reached - next sessions will be dropped.
trace_state implements a similar strategy in regard to tracing events per singe
session.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Uses a CQL keyspace system_traces to store tracing information.
Uses two tables:
CREATE TABLE system_traces.sessions (
session_id uuid,
command text,
client inet,
coordinator inet,
duration int,
parameters map<text, text>,
request text,
started_at timestamp,
PRIMARY KEY ((session_id)))
and
CREATE TABLE system_traces.events (
session_id uuid,
event_id timeuuid,
activity text,
source inet,
source_elapsed int,
thread text,
PRIMARY KEY ((session_id), event_id))
system_traces.sessions table contains records of tracing sessions.
system_traces.sessions columns description:
- session_id: an ID of the session.
- command: type of a command this session was created for
(currently supported "NONE", "QUERY" and "REPAIR").
- client: IP of the client that issued the command.
- coordinator: IP of a coordinator that received the command.
- duration: total duration of the tracing session (in us).
- parameters: optional parameters for this session, passed to
i_trace_state::begin() call.
- request: a CQL command this tracing session is created for.
- started_at: the time the session has been started at.
system_traces.events contains records of separate tracing events.
system_traces.events columns description:
- session_id: an ID of the session.
- event_id: an ID of the event.
- activity: the trace point description - a message given to
i_trace_state::trace().
- source: IP of the Node where trace event was issued.
- source_elapsed: time passed since creation of a tracing session (in us) on
the Node where this trace event was issued.
- thread: name of the thread in who's context this trace event was
issued in (currently its "core N", where 'N' is an index of
a shard the trace event was issued on).
This class will cache lambdas creating the corresponding mutations for each tracing
record requested to be stored till flush() method is called.
flush() will merge all pending mutations to "sessions" and "events" tables and
then apply a mutation to "events" table and when it completes - to "sessions"
table. This way it'll ensure that when some tracing session is visible, all its
events are visible too.
trace_keyspace_helper exposes a few metrics via collectd:
- tracing_error - a total number of errors (not including OOM)
- bad_column_family_errors - number of times a tracing record wasn't
stored because system_trace tables' schema
didn't match the expected value. This may happen if
a DB administrator is doing funny things like altering
the schemas of the above tables.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This class represents an interface for a specific backend that is
going to store tracing information.
The specific implementation may and expected to implement caching
of pending tracing records.
Interface functions are:
- start(): Initialize a backend (e.g. create keyspace and tables).
- stop(): Flush all pending work and shut down the backend.
- store_session_record()/store_event_record():
Cache/store the corresponding tracing records.
- flush(): Flush pending tracing records.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
writes_attempts suppose to count how many time data was sent out, but
currently it counts even those replicas in other DCs that get the data
through a coordinator. Fix it by counting only when data is actually sent.
Message-Id: <20160601153124.GB9939@scylladb.com>
"One of the things we need to do as part of the throttle rework I am doing is to
serialize memtable flushes to some extent - that will guarantee that in case
we're throttling, the flushes finish earlier and release memory earlier, if
compared to the case in which we just let all tables flush freely and
simultaneously."
* seastar 0bcdd28...864d6dc (4):
> Logging framework
> Add libubsan and libasan to fedora deps docs
> tests: add rpc cancellable tests
> rpc: add cancellable interface
Dropped logging implementation in favor of seastar's due to a link
conflict with operator<<.
This series adds a constructor to malformed_sstable_exception that
includes a filename and converts some call-sites to use it.
There are still plenty of low-level sites that don't even know the
sstable filename they are operating on. We need to either change the
code to carry the filename to lower layers or find a higher-level
call-site where we can catch malformed_sstable_exception and rethrow it
with the sstable filename. But that's for another series by someone who
knows the sstable code well.
Refs #669.
This reverts commit b3ed55be1d.
The issue is in the failing dtest, not this commit. Gleb writes:
"The bug is in the test, not the patch. Test waits for repair session
to end one way or the other when node is killed, but for nodetool to
know if repair is completed it needs to poll for it. If node dies
before nodetool managed to see repair completion it will stuck
forever since jmx is alive, but does not provide answers any more.
The patch changes timing, repair is completed much close to exit now,
so problem appears, but it may happen even without the patch.
The fix is for dtest to kill jmx as part of killing a node
operation."
Now that Lucas fixed the problem in scylla-ccm, revert the revert.
We can only free memory for a region_group when the entire memtable is released.
This means that while the disk can handle requests from multiple memtables just fine,
we won't free any memory until all of them finish. If we are under a pressure situation
we will take a lot more time to leave it.
Ideally, with write-behind, we would allow just one memtable to be flushed at a
time. But since we don't have it enabled, it's better to serialize the flushes
so that only some memtables (4) are flushed at a time. Having the memtable writer
bandwidth all to itself, the memtable will finish sooner, release memory sooner,
and recover the system's health sooner.
We would like to do that without having streaming and memtables starve each
other. Ideally, that should mean half the bandwidth for each - but that
sacrifices memtable writes in the common case there is no streaming. Again,
write behind will help here, and since this is something we intend to do, there
is no need to complicate the code too much for an interim solution.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch introduces an explicit behavior enum class - one of delayed or
immediate, that allow callers to tell the memtable list whether they want a
delayed flush (default), or force an immediate flush. So far this only affects
the streaming code (memtables just ignore it), but the concept is one that can
be easily generalized.
With that in place, we can revert back the stop function to use the standard
flush. I have argued before that adding infrastructure like that would not be
worth it for the sake of stop alone, but some other code could now use it.
Specifically, the active reclaimer for the throttler would like to force
immediate flushes, as delayed flushes really won't make a lot of difference in
reducing memory usage.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When a node starts up, peer node can send gossip syn message to it
before the gossip message handlers are registered in messaging_service.
We can see:
scylla[123]: [shard 0] rpc - client a.b.c.d: unknown verb exception 6 ignored
To fix, we delay the listening of messaging_service to the point when
gossip message handlers are registered.
Message-Id: <9b20d85e199ef0e44cdcde2920123a301a88f3d7.1464254400.git.asias@scylladb.com>
Metadata usually doesn't change after it is created; make that visible in
the code, allowing further optimizations to be applied later.
Message-Id: <1464334638-7971-3-git-send-email-avi@scylladb.com>
Rather than dynamic_cast<>ing the statement to see whether it is a
select statement, add a virtual function to cql_statement to get the
result metadata.
This is faster and easier to follow.
Message-Id: <1464334638-7971-2-git-send-email-avi@scylladb.com>
Scylla-jmx and collectd can preempt scylla and induce long latencies. Tune
the scheduler to provide lower latencies.
Since when the support processes are not running we normally do not context
switch (one thread per core, remember?), there should be no effect on
throughput.
The tunings are provided in a separate package, which can be uninstalled
if the server is shared with other applications which are negatively
affected by the tuning.
Fixes#1218.
Message-Id: <1464529625-12825-1-git-send-email-avi@scylladb.com>
compact_on_idle will lead users to thinking we're talking about sstable
compaction, not log-structured-allocator compaction.
Rename the variable to reduce the probability of confusion.
Message-Id: <1464261650-14136-1-git-send-email-avi@scylladb.com>
When read/write to a partition happens in parallel reader may detect
digest mismatch that may potentially cause cross DC read repair attempt,
but the repair is not really needed, so added latency is not justified.
This patch tries to prevent such parallel access from causing heavy
cross DC repair operation buy checking a timestamp of most resent
modification. If the modification happens less then "write timeout"
seconds ago the patch assumes that the read operation raced with write
one and cancel cross DC repair, but only if CL is LOCAL_*.
The space calculation counters in column family had two problem:
1. The total bytes is an ever growing counter, which is meaningless for
the API.
2. Trying to simply sum the size on all shards, ignores the fact that the
same sstable file can be referenced by multiple shards, this is
especially noticeable during migration time.
To solve this, the implementation was modified so instead of
collecting the sizes, the API would collect a map of file name to size
and then would do the summing.
This removes the duplications and fixes the total bytes calculation
Calling cfstats before the change with load after a compaction happend:
$ nodetool cfstats keyspace1
Keyspace: keyspace1
Verify write latency 1068253.0 76435
Read Count: 75915
Read Latency: 0.5953986037015082 ms.
Write Count: 76435
Write Latency: 0.013975966507490025 ms.
Pending Flushes: 0
Table: standard1
SSTable count: 5
Space used (live): 44261215
Space used (total): 219724478
After the fix:
$ nodetool cfstats keyspace1
Keyspace: keyspace1
Verify write latency 1863206.0 124219
Read Count: 125401
Read Latency: 0.9381053978835895 ms.
Write Count: 124219
Write Latency: 0.01499936402643718 ms.
Pending Flushes: 0
Table: standard1
SSTable count: 6
Space used (live): 50402904
Space used (total): 50402904
Space used by snapshots (total): 0
Fixes: #1042
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1464518757-14666-2-git-send-email-amnon@scylladb.com>
We have recently commited a fix to a broken streaming bug that involved
reverting column_family::stop() back to calling the custom seal functions
explicitly for both memtables and streaming memtables.
We here add a comment to explain why that had to be done.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <fe94b5883e9c29adc7fc9ee9f498894c057e7b64.1464293167.git.glauber@scylladb.com>
"This patch changes the way we wait for supported features. We no longer
sleep periodically, waking up to check if the wanted features are now
avaiable. Instead, we register waiters in a condition variable that is
signaled whenever new endpoint information is received.
We also add a new poll interface based on the feature class, which
encapsulates the availability of a cluster feature."
This class encapsulates the waiting for a cluster feature. A feature
object is registered with the gossiper, which is responsible for later
marking it as enabled.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the sleep-based mechanism of detecting new features
by instead registering waiters with a condition variable that is
signaled whenever a new endpoint information is received.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch removes the timeout when waiting for features,
since future patches will make this argument unnecessary.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch fixes an inadvertent change to the shadow endpoint state
map in gossiper::run, done by calling get_heart_beat_state() which
also updates the endpoint state's timestamp. This did not happen for
the normal map, but did happen for the shadow map. As a result, every
time gossiper::run() was scheduled, endpoint_map_changed would always
be true and all the shards would make superfluous copies of the
endpoint state maps.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1464309023-3254-2-git-send-email-duarte@scylladb.com>
"This patchset provides a way to enable SET_NIC(posix_net_conf.sh) on
non-AMI environment.
Also support -mq option of the script.
This also contains number of bug fixes of scripts.
Fixes#1192"
NOTE: scyllatop now requires the urwid library
previously, if there were more metrics that lines in the terminal
window, the user could not see some of the metrics. Now the user can
scroll.
As an added bonus, the program will not crash when the window size
changes.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1464098832-5755-1-git-send-email-yoav@scylladb.com>
Vlad reported a strange user configuration:
SCYLLA_ARGS="--log-to-syslog 1 --log-to-stdout 0 --default-log-level
info --collectd-address=127.0.0.1:25826 --collectd=1
--collectd-poll-period 60000 --network-stack posix --num-io-queues 32
--max-io-requests 128 --replace-address 10.0.4.131"
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "10.0.4.131"
In the mean while, 10.0.4.131 is the IP address of the node itself.
When the node was started, the following message were reported.
Apr 13 06:31:12 n0 scylla[19681]: [shard 0] gossip - Connect seeds again
... (20 seconds passed)
Apr 13 06:31:13 n0 scylla[19681]: [shard 0] gossip - Connect seeds again
... (21 seconds passed)
Apr 13 06:31:14 n0 scylla[19681]: [shard 0] gossip - Connect seeds again
... (22 seconds passed)
Apr 13 06:31:15 n0 scylla[19681]: [shard 0] gossip - Connect seeds again
... (23 seconds passed)
The configruation is invalid, becasue for --replace-address to
work, at least one working seed node should be alive. Catch the
configuration error and fail it with an appropriate error message.
Fixes#1183
Message-Id: <a94a082d896313e7a668915ae21fe2c03719da3a.1464164058.git.asias@scylladb.com>
_live_endpoints_just_added tracks the peer node which just becomes live.
When a down node gets back, the peer nodes can receive multiple messages
which would mark the node up, e.g., the message piled up in the sender's
tcp stack, after a node was blocked with gdb and released. Each such
message will trigger a echo message and when the reply of the echo
message is received (real_mark_alive), the same node will be added to
_live_endpoints_just_added.push_back more than once. Thus, we see the
same node be favored more than once:
INFO 2016-04-12 12:09:57,399 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:09:58,412 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:09:59,429 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:00,429 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:01,430 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:02,442 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:03,454 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
To fix, do not insert the node if it is already in
_live_endpoints_just_added.
Fixes#1178
Message-Id: <6bcfad4430fbc63b4a8c40ec86a2744bdfafb40f.1464161975.git.asias@scylladb.com>
In commit 4981362f57, I have introduced a regression that was thankfully
caught by our dtest infrastructure.
That patch is a preparation patch for the active reclaim patchset that is to
come, and it consolidated all the flushes using the memtable_list's seal_fn
function instead of calling the seal function explicitly.
The problem here is that the streaming memtables have the delayed mechanism,
about which the memtable_list is unaware. Calling memtable_list's
seal_active_memtable() for the streaming memtables calls the delayed version,
that does not guarantee flush. If we're lucky, we will indeed flush after the
timer expires, but if we're not we'll just stop the CF with data not flushed.
There are two options to fix this: the first is to teach the memtable_list about
the delayed/forced mechanism, and the second is to just call the correct
function explicitly during shutdown, and then when the time comes to add
continuations to the result of the seal, add them here as well.
Although the second option involves a bit more work and duplication, I think it
is better in the sense that the delayed / forced mechanism really is something
that belong to the streaming only. Being this the only user, I don't think it
justifies complicating the memtable_list with this concept.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b26017c825ccf585f39f58c4ab3787d78e551f5f.1464126884.git.glauber@scylladb.com>
"This change is intended to make migration process safer and easier.
All column families will now have a directory called upload.
With this feature, users may choose to copy migrated sstables to upload
directory of respective column families, and run 'nodetool refresh'.
That's supposed to be the preferred option from now on."
The default CQL frame compression algorithm in Cassandra is LZ4. Add
support for decompressing incoming frames and compressing outgoing
frames with LZ4 if the CQL driver asks for that.
Fixes#416
Message-Id: <1464086807-11325-1-git-send-email-penberg@scylladb.com>
* seastar 6a849ac...aed893e (3):
> net: move 'transport' enum to seastar namespace
> net: sctp protocol support for posix stack
> future: Support get() when state is at a promise
This patch solve a problem where a complex type is define as version
depended (with the version attribute) but doesn't have a default value.
In those cases the default constructor is used, but in the case of
complex types (template) param_type should be use to get the C++ type.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1463916723-15322-1-git-send-email-amnon@scylladb.com>
This change is intended to make migration process safer and easier.
All column families will now have a directory called upload.
With this feature, users may choose to copy migrated sstables to upload
directory of respective column families, and call 'nodetool refresh'.
That's supposed to be the preferred option from now on.
For each sstable in upload directory, refresh will do the following:
1) Mutate sstable level to 0.
2) Create hard links to its components in column family dir, using
a new generation. We make it safe by creating a hard link to temporary
TOC first.
3) Remove all of its components in upload directory.
This new code runs after refresh checked for new sstables in the column
family directory. Otherwise, we could have a generation conflict.
Unlike the first step, this new step runs with sstable write enabled.
It's easier here because we know exactly which sstables are new.
After that, refresh will load new sstables found in column family
and upload directories.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It's not working because it tries to overwrite existing statistics
file with exclusive flag.
It's fixed by writing new statistics into temporary file and
renaming it into place.
If Scylla failed in middle of rewrite, a temporary file is left
over. So boot code was adjusted to delete a temporary file created
by this rewrite procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, we register snitch API in set_server_gossip_settle() which
waits until a node has joined the cluster. This makes 'nodetool status'
not properly show the status of a joining node. Fix the issue by
registering snitch API earlier.
Fixes#1269.
Message-Id: <1463576381-15484-1-git-send-email-penberg@scylladb.com>
Since we added scylla-conf package, we cannot install scylla-server/-tools without the package, because of this --localrpm is failing.
So copy scylla-conf package to AMI, and install it to fix the problem.
These parameters are only required for AMI, not for non-AMI environment which want to enable SET_NIC, so split them to indivisual script / conf file, call it from AMI install script.
In a preparation move for the LSA throttler, we have reordered the
initialization fields in database.hh so that the sizes of the regions are
computed before the initialization of the region.
However, that seemingly innocent move broke one of our tests. The reason behind
that, is that if we don't destroy the column families before destroying the
region, we may end up with a use after free in the memtable destructor - that
itself expects to call into the region.
This patch reorders the initialization so that the CF list still comes after the
dirty regions (therefore being destroyed first), while maintaining the relative
ordering between size / region that we needed in the first place.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0669984b5bccdb2c950f2444bdee4427abad56ba.1463508884.git.glauber@scylladb.com>
In perf-flame, I saw in
service::storage_proxy::create_write_response_handler (2.66% cpu)
gossiper::is_alive takes 0.72% cpu
locator::token_metadata::pending_endpoints_for takes 1.2% cpu
After this patch:
service::storage_proxy::create_write_response_handler (2.17% cpu)
gossiper::is_alive does not show up at all
locator::token_metadata::pending_endpoints_for takes 1.3% cpu
There is no need to copy the endpoint_state from the endpoint_state_map
to check if a node is alive. Optimize it since gossiper::is_alive is
called in the fast path.
Message-Id: <2144310aef8d170cab34a2c96cb67cabca761ca8.1463540290.git.asias@scylladb.com>
Refresh will rewrite statistics of any migrated sstable with level
> 0. However, this operation is currently not working because O_EXCL
flag is used, meaning that create will fail.
It turns out that we don't actually need to change on-disk level of
a sstable by overwriting statistics file.
We can only set in-memory level of a sstable to 0. If Scylla reboots
before all migrated sstables are compacted, leveled strategy is smart
enough to detect sstables that overlap, and set their in-memory level
to 0.
Fixes#1124.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
pending_endpoints_for is called frequently by
storage_proxy::create_write_response_handler when doing cql query.
Before this patch, each call to pending_endpoints_for involves
converting a multimap (std::unordered_multimap<range<token>,
inet_address>>) to map (std::unordered_map<range<token>,
std::unordered_set<inet_address>>).
To speed up the token to pending endpoint mapping search, a interval map
is introduced. It is faster than searching the map linearly and can
avoid caching the token/pending endpoint mapping.
With this patch, the operations per second drop during adding node
period gets much better.
Before:
45K to 10K
After:
45k to 38K
(The number is measured with the streaming code skipping to send data to
rule out the streaming factor.)
Refs: #1223
object
The API would expose now the rate_moving_average and
rate_moving_average_and_histogram.
The old end points remains for the transition period, but marked as
depricated.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch replaces the latency histogram to
rate_moving_avrage_and_histogram and the counters to
rate_moving_average.
The old endpoints where left unchagned but marked as depricated when
needed.
This patch replaces the helper function for column family with two
function, one that collect the relevant column family from all shareds
and another one that do the translation to json object.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
timed_rate_moving_average_and_histogram
As part of moving the derived statistic in to scylla, this replaces the
histogram object in the column_family to
timed_rate_moving_average_and_histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
As part of moving the derived statistic in to scylla, this replaces the
counter in the row_cache stats to
timed_rate_moving_average_and_histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
timed_rate_moving_average_and_histogram
As part of moving the derived statistic in to scylla, this replaces the
histogram object in the storage_proxy to
timed_rate_moving_average_and_histogram. and the read, write and range
counters where replaced by rate_moving_average.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the helper function that are used to sum the
rate_moving_average and rate_moving_average_and_histogram.
The current sum functionality for histogram was modified to support
rate and histogram but return a histogram. This way current endpoints
would continue to behave the same.
It also cleans the histogram related method by using the plus operator
in the histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a few data structure for derived and accumulative
statistics that are similiar to the yammer implementation used by the
JMX.
It also adds a plus operator to histogram which cleans the histogram
usage.
moving_average - An exponentially-weighted moving average. calculate an event rate
on a given interval.
rate_moving_average and timed_rate_moving_average - Calculate 1m, 5m and
15m ewma an all time avrage and a counter.
rate_moving_average_and_histogram and
timed_rate_moving_average_and_histogram - Combines a histogram with a
rate_moving_average. It also expose a histogram API so it will be an
easy task to replace a histogram with a
timed_rate_moving_average_and_histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We've been keeping two constructors for the column family to allow for a
version without the commitlog. But it's by now quite complicated to maintain
the two, because changes always have to be made in two places.
This patch adds a private constructor that does the actual construction, and
have the public constructors to call it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <dd3cb0b9c20ad154a6131bad6ece619f70ed5025.1463448522.git.glauber@scylladb.com>
I would like to be able to apply a function at the end of every flush, that is
common for both memtables and streaming memtables. For instance, to unthrottle
current waiters. Right now some calls to seal_active_memtable are open coded,
calling the column family's function directly, for both the main memtable list
and the streaming list.
This patch moves all the current open code callers to call the respective
memtable_list function.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0c780254f3c4eb03e2bcd856b83941cf49a84b85.1463448522.git.glauber@scylladb.com>
As Nadav pointed out, SETENV and sudo -E might be causes security hole:
https://github.com/scylladb/scylla/issues/1028#issuecomment-196202171
So drop them now, sourcing envfiles from scylla_prepare / scylla_stop scripts
instead.
Also on "[PATCH] ubuntu: Fix the init script variable sourcing" thread
we have problem to passing variables from envfiles to scylla_prepare /
scylla_stop on Ubuntu, it seems better to sourcing from these scripts.
Additionally, this fixes#1249
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462989906-30062-1-git-send-email-syuu@scylladb.com>
"The Prepared message has a metadata section that's similar to result set
metadata but not exactly the same. Fix serialization by introducing a
separate prepared_metadata class like Origin has and implement
serialization as per the CQL protocol specification. This fixes one CQL
binary protocol version 4 issue that we currently have.
The changes have been verified by running the gocql integration tests
using v4. Please note that this series does *not* enable v4 for clients
because Cassandra 2.1.x series only supports CQL binary protocol v3."
Introduce a new prepared_metadata class that holds prepared statement
metadata and implement CQL binary protocol serialization that works for
all versions.
From Piotr:
Fixes#656.
It makes it possible to slice using clustering ranges in mutation
readers. We don't have row index yet so the slicing is just ignoring
data which is out of range.
Add additional parameters to mp_row_consumer to be able to fetch
only cells for given clustering key ranges
This will be used in row_cache when it will work on clustering key
level instead of partition key level.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
rate_moving_average and rate_moving_average_and_histogram are type that
are used by the JMX. They are based on the yammer meter and timer and
are used to collect derivative information.
Specificlly: rate_moving_average calculate rates and
rate_moving_average_and_histogram collect rates and
histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When running with DEBUG verbosity, scyllatop will now log every single
value it receives from collectd. When you suspect that scyllatop is
somehow distorting values, this is a good way to check it.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1463320730-6631-1-git-send-email-yoav@scylladb.com>
"Writes may start to be rejected by replicas after issuing alter table
which doesn't affect columns. This affects all versions with alter table
support.
Fixes#1258"
Currently we only do that when column set changes. When prepared
statements are executed, paramaters like read repair chance are read
from schema version stored in the statement. Not invalidating prepared
statements on changes of such parameters will appear as if alter took
no effect.
Fixes#1255.
Message-Id: <1462985495-9767-1-git-send-email-tgrabiec@scylladb.com>
Spotted during code review.
If it doesn't defer, we may execute then_wrapped() body before we
change the state. Fix by moving then_wrapped() body after state changes.
The problem was that "s" would not be marked as synced-with if it came from
shard != 0.
As a result, mutation using that schema would fail to apply with an exception:
"attempted to mutate using not synced schema of ..."
The problem could surface when altering schema without changing
columns and restarting one of the nodes so that it forgets past
versions.
Fixes#1258.
Will be covered by dtest:
SchemaManagementTest.test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns
Stop using /var/lib/scylla, use $SCYLLA_HOME instead.
systemd seems does not extract variables on Environment="HOME=$SCYLLA_HOME", but both CentOS/Ubuntu able to run scylla-server without $HOME, so dropped it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462977871-26632-1-git-send-email-syuu@scylladb.com>
ALTER KEYSPACE should allow no replication strategy to be set,
in which case old strategy should be kept.
Initial translation from origin missed this.
Fixes#1256
Message-Id: <1462967584-2875-2-git-send-email-calle@scylladb.com>
Currently scylla_io_setup hardcoded to run iotune on /var/lib/scylla, but user may change data directory by modifying scylla.yaml, and it may on different block device.
So use scylla_config_get.py to get configuration from scylla.yaml, passes it to iotune.
Fixes#1167
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462955824-21983-2-git-send-email-syuu@scylladb.com>
To parse scylla.yaml, scylla_config_get.py is added.
It can be use like 'scylla_config_get.py [key name]' from shell script, or command line.
This is needed for scylla_io_setup, to get 'data_file_directories' from shellscript.
Currently it does not supported to specify key name of nested data structure, but enough for scyll_io_setup.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462955824-21983-1-git-send-email-syuu@scylladb.com>
Since Ubuntu 14.04LTS needs scylla-gdb package which install to /opt/scylladb, we need to port scylla-env package to Ubuntu as well.
This change introduces scylla-env package to Ubuntu 14.04LTS.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462825880-20866-2-git-send-email-syuu@scylladb.com>
Since Ubuntu 14.04LTS needs scylla-gdb package which install to /opt/scylladb, we need to port scylla-env package to Ubuntu as well.
To do it, share the package directory on dist/common/dep at first.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462825880-20866-1-git-send-email-syuu@scylladb.com>
Reloads keyspace metadata and replaces in existing keyspace.
Note: since keyspace metadata, and consequently, replication
strategy now becomes volatile, keyspace::metadata now returns
shared pointer by value (i.e. keep-alive).
Replication strategy should receive the same treatment, but
since it is extensively used, but never kept across a
continuation, I've just added a comment for now.
1.) It most likely is not, i.e. either tcp or more likely, ssl
negotiation failure. In any case, we can still try next
connection.
2.) Not retrying will cause us to "leak" the accept, and then hang
on shutdown.
Also, promote logging message on accept exception to "warn", since
dtest(s?) depend on seeing log output.
Message-Id: <1462283265-27051-4-git-send-email-calle@scylladb.com>
To simplify init of msg service, use credendials_builder
to encapsulate tls options so actual credentials can be
more easily created in each shard.
Message-Id: <1462283265-27051-2-git-send-email-calle@scylladb.com>
* seastar 7782ad4...3dec26f (3):
> tests/mkcert.gmk: Fix makefile bug in snakeoil cert generator
> tls_test: Add case to do a little checking of credentials_builder
> tls: Add credentials_builder - copyable credentials "factory"
From Avi:
When we shut down, we may have to give up on some pending atomic
sstable deletions, because not all shards may have agreed to delete
all members of the set.
This is expected, so silence these frightening error messages.
Fixes#1235.
This patch adds support for secure connection attempts to be
cancellable.
Fixes#862
Includes seastar upstream merge:
* seastar f1a3520...7782ad4 (1):
> Merge "rpc: Allow client connections to be cancelled" from Duarte
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1462783335-10731-1-git-send-email-duarte@scylladb.com>
Avi says:
"During shutdown, we prevent new compactions, but perhaps too late.
Memtables are flushed and these can trigger compaction."
To solve that, let's stop compaction manager at a very early step
of shutdown. We will still try to stop compaction manager in
database::stop() because user may ask for a shutdown before scylla
was fully started. It's fine to stop compaction manager twice.
Only the first call will actually stop the manager.
Fixes#1238.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <c64ab11f3c91129c424259d317e48abc5bde6ff3.1462496694.git.raphaelsc@scylladb.com>
* seastar e536555...ab74536 (4):
> reactor: kill max_inline_continuations
> smp: optimize smp_message_queue::flush_request_batch() for empty queue
> thread: do not yield if idle
> Merge "Fixes for iotune" from Glauber
Clustering key prefix may have less columns than described in schema.
Deserailiaztion should stop when end of buffer is reached.
Message-Id: <20160503140420.GP23113@scylladb.com>
It was noticed that small sstables will accumulate for a column family because
scylla was limited to two compaction per shard, and a column family could have
at most one compaction running at a given shard. With the number of sstables
increasing rapidly, read performance is degraded.
At the moment, our compaction manager works by running two compaction task
handlers that run in parallel to the rest of the system. Each task handler
gets to run when needed, gets a column family from compaction manager queue,
runs compaction on it, and goes to sleep again. That's basically its cycle.
Compaction manager only allows one instance of a column family to be on its
queue, meaning that it's impossible for a column family to be compacted in
parallel. One compaction starts after another for a given column family.
To solve the problem described, we want to concurrently run compaction jobs
of a column family that have different "size tier" (or "weight").
For those unfamiliar, compaction job contains a list of sstables that will be
compacted together.
The "size tier" of a compaction job is the log of the total size of the input
sstables. So a compaction job only gets to run if its "size tier" is not the
same of an ongoing compaction. There is no point in compacting concurrently at
the same "size tier", because that slows down both compactions.
We will no longer queue column families in compaction manager. Instead, we
create a new fiber to run compaction on demand.
This fiber that runs asynchronously will do the following:
1) Get a compaction job from compaction strategy.
2) Calculate "size tier" of compaction job.
3) Run compaction job if its "size tier" is not the same of an ongoing
compaction for the given column family.
As before, it may decide to re-compact a column family based on a stat stored
in column family object.
Ran all compaction-related dtests.
Fixes#1216.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d30952ff136192a522bde4351926130addec8852.1462311908.git.raphaelsc@scylladb.com>
In initial implementation I figured this was not required, but
we get issues communicating across nodes if system tables
don't have the same UUID, since creation is forcefully local, yet
shared.
Just do a manual re-create of the scema with a name UUID, and
use migration manager directly.
Message-Id: <1462194588-11964-1-git-send-email-calle@scylladb.com>
The patch calculates row count during result building and while merging.
If one of results that are being merged does not have row count the
merged result will not have one either.
Fixes: #1220
While the server_credentials object is technically immutable
(esp with last change in seastar), the ::shared_ptr holding them
is not safe to share across shards.
Pre-create cpu x credentials and then move-hand them out in service
start-up instead.
Fixes assertion error in debug builds. And just maybe real memory
corruption in release.
Requires seastar tls change:
"Change server_credentials to copy dh_params input"
Message-Id: <1462187704-2056-1-git-send-email-calle@scylladb.com>
Leveled compaction strategy is doing a lot of work whenever it's asked to get
a list of sstables to be compacted. It's checking if a sstable overlaps with
another sstable in the same level twice. First, when adding a sstable to a
list with sstables at the same level. Second, after adding all sstables to
their respective lists.
It's enough to check that a sstable creates an overlap in its level only once.
So I am changing the code to unconditionally insert a sstable to its respective
list, and after that, it will call repair_overlapping_sstables() that will send
any sstable that creates an overlap in its level to L0 list.
By the way, the optimization isn't in the compaction itself, instead in the
strategy code that gets a set of sstables to be compacted.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <8c8526737277cb47987a3a5dbd5ff3bb81a6d038.1461965074.git.raphaelsc@scylladb.com>
Currently scylla_setup is unusable when user does not want to install scylla-jmx because it checks package unconditionally, but some users (or developers) does not want to install it, so let's ask to skip check or not on interactive prompt.
Also, scylla-tools package should installed for most of the case, added check code for the package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1460662354-10221-1-git-send-email-syuu@scylladb.com>
On Ubuntu 15.04 and newer, official g++ package is >= g++-4.9.
So we don't need to use development repository, just use official package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
To handle scylla startup correctly on systemd versions of Ubuntu, scylla requires to build with libsystemd-dev.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Since 16.04LTS does not support this argument anymore, drop it on recent version of Ubuntu which does not uses Upstart.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Use distribution's thrift if version > 14.04LTS.
14.04LTS doesn't have thrift-compiler-0.9.1, use our version.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Amnon reports that current code fails to compile on gcc 4.9:
distcc[9700] ERROR: compile /home/amnon/.ccache/tmp/query.tmp.localhost.localdomain.9673.ii on localhost failed
In file included from query.cc:30:0:
query-result-reader.hh: In instantiation of ‘void query::result_view::consume(const query::partition_slice&, ResultVisitor&&) [with ResultVisitor = query::result::calculate_row_count(const query::partition_slice&)::<anonymous struct>&]’:
query.cc:196:32: required from here
query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class clustering_key_prefix’ through ‘...’
visitor.accept_new_row(*row.key(), static_row, view);
^
query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’
query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’
query-result-reader.hh:186:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’
visitor.accept_new_row(static_row, view);
^
query-result-reader.hh:186:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’
Work around the problem by not using '...'.
Message-Id: <1460964042-2867-1-git-send-email-tgrabiec@scylladb.com>
Drop dependency to libthrift0 on installation time, link libthrift statically.
With this fix, we don't need to distribute libthrift0 deb package anymore to install scylla-server binary package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1461594460-2403-2-git-send-email-syuu@scylladb.com>
* seastar 2b3c363...15a92cf (2):
> smp: allow more than 128 in-flight operations on core-to-core queue
> future: balance constructors and destructors in future_state<>
Fixes#1205.
"has_keyspace_access" is not supposed to (according to origin)
verify that a keyspace exists. Remove.
It (and all others) are however supposed to check "ks" (name)
not empty. Add this.
Message-Id: <1461578072-24113-1-git-send-email-calle@scylladb.com>
Add the statistics counter for a number of unprepared statements
executions and expose it with collectd.
Since in our implementation a number of unprepared statements executions
equals to a number of executions of prepare() function we may simply
increment the new statistics counter every time query_processor::get_statement()
is called.
Fixes#1068
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1461503492-32228-1-git-send-email-vladz@cloudius-systems.com>
"This series introduces some additional metrics (mostly) in a storage_proxy and
a database level that are meant to create a better picture of how data flows
in the cluster.
First of all where possible counters of each category (e.g. total writes in the storage
proxy level) are split into the following categories:
- operations performed on a local Node
- operations performed on remote Nodes aggregated per DC
In a storage_proxy level there are the following metrics that have this "split"
nature (all on a sending side):
- total writes (attempts/errors)
- writes performed as a result of a Read Repair logic
- total data reads (attempts/completed/errors)
- total digest reads (attempts/completed/errors)
- total mutations data reads (attempts/completed/errors)
In a batchlog_manager:
- writes performed as a result of a batchlog replay logic
Thereby if for instance somebody wants to get an idea of how many writes
the current Node performs due to user requested mutations only he/she has
to take a counter of total writes and subtract the writes resulted by Read
Repairs and batchlog replays.
On a receiving side of a storage_proxy we add the two following counters:
- total number of received mutations
- total number of forwarded mutations (attempts/errors)
In order to get a better picture of what is going on on a local Node
we are adding two counters on a database level:
- total number of writes
- total number of reads
Comparing these to total writes/reads in a storage_proxy may give a good
idea if there is an excessive access to a local DB for example."
Commit d3fe0c5 ("Refactor db/keyspace/column_family toplogy") changed
database::find_keyspace() to throw a std::nested_exception so the catch
block in migration_manager::announce_keyspace_drop() no longer catches
the exception. Fix the issue by explicitly checking if the keyspace
exists and throwing the correct exception type if it doesn't.
Fixes TestCQL.keyspace_test.
Message-Id: <1461218910-26691-1-git-send-email-penberg@scylladb.com>
This patch adds a counter of total writes and reads
for each shard.
It seems that nothing ensures that all database queries are
ready before database object is destroyed.
Make _stats lw_shared_ptr in order to ensure that the object is
alive when lambda gets to incrementing it.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Add split (local Nodes, external Nodes aggregated per Nodes' DCs) counters
for the following read categories:
- data reads
- digest reads
- mutation data reads
Each category is added attempts, completions and errors metrics.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Added split metrics for operations on a local Node and on external
Nodes aggregated per Nodes' DCs.
Added separate split counters for:
- total writes attempts/errors
- read repair write attempts (there is no easy way to separate errors
at the moment)
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
After commit a843aea547, a gate was introduced to make sure that
an asynchronous operation is finished before column family is
destroyed. A sstable testcase was not stopping column family,
instead it just removed column family from compaction manager.
That could cause an user-after-free if column family is destroyed
while the asynchronous operation is running. Let's fix it by
stopping column family in the test.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ed910ec459c1752148099e6dc503e7f3adee54da.1461177411.git.raphaelsc@scylladb.com>
This patch ensures type_parser can handle user defined types. It also
prefixes user_type_impl::make_name() with
org.apache.cassandra.db.marshal.UserType.
Fixes#631
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a function to abstract_type that locates the usage of
a given user_type and recursively returns an updated version of the
containing type containing the updated user type.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a virtual function to the abstract_type hierarchy to
tell whether a given type references the specified type. Needed to
implement the drop and alter type statements.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the merge_types() function,
allowing mutations to user defined types to be applied.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A new user type is checked for compatibility against the previous
version of that type, so as to ensure that an updated field type
is compatible with the previous field type (e.g., altering a field
type from text to blob is allowed, but not the other way around).
However, it is also possible to add new fields to a user type. So,
when comparing a user type against its previous version, we should
also allow the current, new type to be longer than the previous one.
The current code instead allows for the previous type to be longer,
which this patch fixes.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch allows abstract_types to be compared for equality. In
particular, it enables the indirect_equal_to<abstract_type> idiom.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch defines the member functions responsible for announce
create, update and drop user defined types migration.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This field is superfluous and adds confusion regarding the user_types
field in the keyspace metadata.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
All the callers of do_serialize_mutation_form pass a valid tombstone
that is converted into a non-empty optional. This happens even if the
tombstone is empty (tombstone::timestamp == api::missing_timestamp).
This patch fixes this by passing in a reference to the tombstone which
is convertible to bool, based on whether it is empty or not.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1460620528-3628-1-git-send-email-duarte@scylladb.com>
"Conversion/implementation of "authorizer" code from origin, handling
permissions management for users/resources.
Default implementation keeps mapping of <user.resource>->{permissions}
in a table, contents of which is cached for slightly quicker checks.
Adds access control to all (existing) cql statements.
Adds access management support to the CQL impl. (GRANT/REVOKE/LIST)
Verified manually and with dtest auth_test.py. Note that several of these
still fail due to (unrelated) unimplemented features, like index, types
etc.
Fixes#1138"
While scylla-ami-setup.service is running, login message says "run systemctl status scylla-server" to see status, but it actually never launched yet.
This patch fixes the message to notice RAID construction is running, and 'systemctl status scylla-ami-setup' is the correct way to see status.
Fixes#1035
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1460660628-10103-2-git-send-email-syuu@scylladb.com>
From Glauber:
There are current some outstanding issues with the throttling code. It's
easier to see them with the streaming code, but at least one of them is general.
One of them is related to situations in which the amount of memory available
leaves only one memtable fitting in memory. That would only happen with the
general code if we set the memtable cleanup threshold to 100 % - and I don't
even know if it is valid - but will happen quite often with the streaming code.
If that happens, we'll start throttling when that memtable is being written,
but won't be able to put anything else in its place - leading to unnecessary
throttling.
The second, and more serious, happens when we start throttling and the amount
of available memory is not at least 1MB. This can deadlock the database in
the sense that it will prevent any request from continuing, and in turn causing
a flush due to memtable size. It is a good practice anyway to always guarantee
progress.
Fixes#1144
Broken by f15c380a4f.
This resulted in empty collection being returned in the results
instead of no collection.
Fixes org.apache.cassandra.cql3.validation.entities.CollectionsTest
from cassandra-unit-tests.
* seastar 2aeb9dd...2185f37 (15):
> reactor: avoid issuing systemwide memory barriers in parallel
> Revert "Use sys_membarrier() when available"
> Merge "Various exception-safety fixes" from Tomasz
> future-util: make map reduce exception safe
> collectd: do not give up after a failure
> future-util: make repeat_until_value exception safe
> rpc: do not block connection when unknown verbs is received
> rpc: do not wait for a reply after timeout
> rpc: move connection stats to base class
> core/reactor: Handle io_submit failures inside flush_pending_aio
> apps/iotune: add --fs-check option to use iotune for kernel version check
> Merge "Some exception safety patches" from Paweł
> tls: Fix conversion of dh_params::level to gnutls_sec_param_t
> core: posix_thread: Mark start_routine as noexcept
> fair_queue: better overflow protection
"If we compact sstables A, B into a new sstable C we must either delete both
A and B, or none of them. This is because a tombstone in B may delete data
in A, and during compaction, both the tombstone and the data are removed.
If only B is deleted, then the data gets resurrected.
Non-atomic deletion occurs because the filesystem does not support atomic
deletion of multiple files; but the window for that is small and is not
addressed in this patchset. Another case is when A is shared across
multiple shards (as is the case when changing shard count, or migrating
from existing Cassandra sstables). This case is covered by this patchset.
Fixes #1181."
Our current throttling code releases one requests per 1MB of memory available
that we have. If we are below the memory limit, but not by 1MB or more, then
we will keep getting to unthrottle, but never really do anything.
If another memtable is close to the flushing point, those requests may be
exactly the ones that would make it flush. Without them, we'll freeze the
database.
In general, we need to always release at least one request to make sure that
progress is always achieved.
This fixes#1144
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Since commit c1cffd06 logger catch errors internally, so no need to
catch most of them at the top level. Only those that can happen during
parameter evaluation can reach here. Change parameters to not throw
too.
query::result transformation to printable form is very heavy operation
that allocates memory and thus can fail. Add a class to query::result that
can be used with logger to push to string conversion when output is
performed.
This is usually not a problem for the main memtable list - although it can be,
depending on settings, but shows up easily for the streaming memtables list.
We would like to have at least two memtables, even if we have to cut it short.
If we don't do that, one memtable will have use all available memory and we'll
force throttling until the memtable gets totally flushed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
throttle_state is currently a nested member of database, but there is no
particular reason - aside from the fact that it is currently only ever
referenced by the database for us to do so.
We'll soon want to have some interaction between this and the column family, to
allow us to flush during throttle. To make that easier, let's unnest it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This is a preparation patch so we can move the throttling infrastructure inside
the memtable_list. To do that, the region group will have to be passed to the
throttler so let's just go ahead and store it.
In consequence of that, all that the CF has to tell us is what is the current
schema - no longer how to create a new memtable.
Also, with a new parameter to be passed to the memtable_list the creation code
gets quite big and hard to follow. So let's move the creation functions to a
helper.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If sstables A, B are compacted, A and B must be deleted atomically.
Otherwise, if A has data that is covered by a tombstone in B, and that
tombstone is deleted, and if B is deleted while A is not, then the data
in A is resurrected.
Fixes#1181.
A shared sstable must be compacted by all shards before it can be deleted.
Since we're stoping, that's not going to happen. Cancel those pending
deletions to let anyone waiting on them to continue.
When we compact a set of sstables, we have to remove the set atomically,
otherwise we can resurrect data if the following happens:
insert data to sstable A
insert tombstone to sstable B
compact A+B -> C (removing both data and tombstone)
delete B only
read data from A
Since an sstable may be shared by multiple shard, and each shard performs
compaction at a different time, we need to defer deletion of an sstable
set until all shards agree that the set can be deleted.
An additional atomicity issue exists because posix does not provide a way
to atomically delete multiple files. This issue is not addressed by this
patch.
"With this series, we support all the 3 nodetool removenode commands, e.g.,
$ nodetool removenode 778948bf-6709-4eb5-80fe-bee911e9c3bf
$ nodetool removenode status
RemovalStatus: Removing token (-8969872965815280276). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
$ nodetool removenode force
RemovalStatus: No token removals in process.
Tested with:
1)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
2)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
- kill -9 node3
- nodetool removenode will wait forever since node3 is gonne, node3
will never send the replication confirmation to node1
- run nodetool removenode force on node1
nodetool removenode completes with the following error:
$ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e
nodetool: Scylla API server HTTP POST to URL
'/storage_service/remove_node' failed: nodetool removenode force is called by user
nodetool removenode force completes sucessfully
$ nodetool removenode force
RemovalStatus: Removing token (-9171569494049085776). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
Fixes #1135."
Make the Docker image more user-friendly by starting up JMX proxy in the
background and install Scylla tools in the image. Also add a welcome
banner like we have with our AMI so that users have pointers to nodetool
and cqlsh, as well as our documentation.
Message-Id: <1460376059-3678-1-git-send-email-penberg@scylladb.com>
This is kind of sorting, so it belongs there, but it also fixes a bug in
storage_proxy::get_read_executor() that assumes filter_for_query() do
not change order of nodes in all_nodes when extra replica is chosen.
Otherwise if coordinator ip happens to be last in all_nodes then it will
be chosen as extra replica and will be quired twice.
Message-Id: <1460549369-29523-1-git-send-email-gleb@scylladb.com>
A new user type is checked for compatibility against the previous
version of that type, so as to ensure that an updated field type
is compatible with the previous field type (e.g., altering a field
type from text to blob is allowed, but not the other way around).
However, it is also possible to add new fields to a user type. So,
when comparing a user type against its previous version, we should
also allow the current, new type to be longer than the previous one.
The current code instead allows for the previous type to be longer,
which this patch fixes.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1460627939-11376-12-git-send-email-duarte@scylladb.com>
We choosed #!/bin/sh for shebang when we started to implement installer scripts, not bash.
After we started to work on Ubuntu, we found that we mistakenly used bash syntax on AMI script, it caused error since /bin/sh is dash on Ubuntu.
So we changed shebang to /bin/bash for the script, from that time we have both sh scripts and bash scripts.
(2f39e2e269)
If we use bash syntax on sh scripts, it won't work on Ubuntu but works on Fedora/CentOS, could be very easy to confusing.
So switch all scripts to #!/bin/bash. It will much safer.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1460594643-30666-1-git-send-email-syuu@scylladb.com>
"This patchset contains some fixes spotted during post-merged review
by {Nad,}av{,i}. I don't consider any of them a must for backport to 1.0,
but since we haven't yet even backported the main series, might as well backport
everything.
It also includes some unit tests to make sure that they will be kept working
in the future."
There is a problem in the implementation of leveled compaction strategy that
prevents level 1 from being compacted into level 2, and so forth. As a result,
all sstables will only belong to either level 0 or 1. One of the consequences
is level 1 being overwhelmed by a huge amount of sstables.
The root of the problem is a conditional statement in the code that prevents a
single sstable, with level > 0, from being compacted into a subsequent level
that is empty or has no overlapping sstables.
Fixes#1180.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <9a4bffdb0368dea77b49c23687015ff5832299ab.1460508373.git.raphaelsc@scylladb.com>
They are used by nodetool removenode:
$ nodetool removenode force
$ nodetool removenode status
For example:
$ nodetool removenode status
RemovalStatus: Removing token (-8969872965815280276). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
$ nodetool removenode force
RemovalStatus: No token removals in process.
Tested with:
1)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
2)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
- kill -9 node3
- nodetool removenode will wait forever since node3 is gonne, node3
will never send the replication confirmation to node1
- run nodetool removenode force on node1
nodetool removenode completes with the following error:
$ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e
nodetool: Scylla API server HTTP POST to URL
'/storage_service/remove_node' failed: nodetool removenode force is called by user
nodetool removenode force completes sucessfully
$ nodetool removenode force
RemovalStatus: Removing token (-9171569494049085776). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
Fixes 1135.
This change will allow user to specify the maximum size of a new sstable
created as a result of leveled compaction.
Example of using this setting:
ALTER TABLE ks.test5 with compaction = {'sstable_size_in_mb': '1000',
'class': 'LeveledCompactionStrategy'}
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ebb9844401af74388bda12586c2435283f6d8db8.1460486043.git.raphaelsc@scylladb.com>
When we recreate the summary from a missing Summary, we should make
sure it is generated sanely, and that it resembles the Summary that
would have otherwise been there.
In this tests we'll grab one of the Summary tests we've been doing,
and just apply them to the non-existent Summary file. We expect
the same results on those cases. Plus, a new test is added with some
sanity checking.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Now that we can boot without a Summary file, we can just as easily boot
with a broken one.
Suggested by Nadav, and it is actually very easy to do, so do it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Spotted by Avi post-merge
1) Need to close the file
2) Should be using the parameter pc instead of the default_class
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This shouldn't be a problem in practice, because if read_toc() fails,
the users will just tend to discard the sstable object altogether, and
not insist on using it.
However, if somebody does try to keep using it, a subsequent read_toc() could
theoretically have some components filled up leading the new reader to believe
the toc was populated successfully.
It is easier to just clear the _components set and never worry about it, than
trying to reason about whether or not that could happen.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
To share systemd unit file between Fedora/CentOS and Ubuntu, generate
systemd unit file on building time since Fedora/CentOS and Ubuntu has
sysconfdir on different place.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459779957-11007-1-git-send-email-syuu@scylladb.com>
Using leveled compaction strategy, only a few sstables will contain a
given key, so we need to filter out the rest. Using the summary entries
to filter keys works if the key is before the first summary entry,
but does not work if it is after the last summary entry, because the last
summary entry does not represent the last key; so sstables that are
are towards the beginning of the ring are read even if they do not contain
the key, greatly reducing read performance.
Fix by consulting the summary's first_key/last_key entries before consulting
the summary entry array.
Hints are currently unimplemented but there is code depending on the
fact that hint_to_dead_endpoints() doesn't throw.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Some of the exceptions are not thrown but constructed and set to some
future. In such case if there is another exception thrown in the
constructor it won't be propagated properly as it will casue stack to be
unwind in the place where the future is set, not in the continuation
chain waiting for it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If the other end of the connection has already disconnected the shutdown
will fail with ENOTCONN. The resulting exception is going to propagate
through the continuation chain that is supposed to shut the cql server
down preventing it from properly waiting for all outstanding
continuations.
The solution is to just ignore any errors that shutdown() may return.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
seastar::async() creates a seastar thread and to do that allocates
memory. That allocation, obviously, may fail so the error handling code
needs to be moved so that it also catches errors from thread creation.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Most of them are missing std::bad_alloc (which leads to aborts) and they
force the compiler to add unnecessary runtime checks.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
live_row_count is summed several times in the same function. Do it only
once.
--
v1->v2:
- call get() on std::reference_wrapper<std::vector<partition>> to get
to reference for moving out of it.
Message-Id: <20160411123829.GE21479@scylladb.com>
"Adds support for CQL commands to create, alter, drop and list users.
Verified manually and by relevant dtests.
With this patch set, scylla supports adding super/regular users and
run sessions logged in as these. Note however that since actual
authorization is still not implemented, no CF/KS is really protected
by the authentication beyond initial login.
Some fixes for lingering bugs in user management in the existing code
as well.
Fixes#1121"
The default recognition error messages in antlr C++ backend are
different from Java backend which makes Scylla's CQL error messages
incompatible with Cassandra. This makes it very hard to write CQL level
test cases which are portable between Scylla and Cassandra.
To fix the issue, override the most common lexer and parser error
messages to follow the convention set by the antlr Java backend. This
unlocks various test cases in AlterTest, for example.
Message-Id: <1460032883-14422-1-git-send-email-penberg@scylladb.com>
transport::server uses client_state in a move-temporary-around
fashion. Having a setter that does continuation-bound validation
makes this messier. Break them up to separate "this" placement
from the actual validation continuation logic
This is even a more elaborate tombstone merging unit test, with
3 levels of nesting, which did not pass with older range-tombstone
merging algorithms, and works with the current one.
I started with deletion of three nested levels of row -
aaa, aaa:bbb, and aaa:bbb::ccc. I then complicated the sstable
even further by adding additional middle-points with the same
timestamps (which we saw happening in some real-life sstables),
resulting in:
[
{"key": "pk",
"cells": [["aaa:_","aaa:bba:_",1459438519943668,"t",1459438519],
["aaa:bba:_","aaa:bbb:_",1459438519943668,"t",1459438519],
["aaa:bbb:_","aaa:bbb:ccb:_",1459438519950348,"t",1459438519],
["aaa:bbb:ccb:_","aaa:bbb:ccc:_",1459438519950348,"t",1459438519],
["aaa:bbb:ccc:_","aaa:bbb:ccc:!",1459438519958850,"t",1459438519],
["aaa:bbb:ccc:!","aaa:bbb:ddd:!",1459438519950348,"t",1459438519],
["aaa:bbb:ddd:!","aaa:bbb:!",1459438519950348,"t",1459438519],
["aaa:bbb:!","aaa:!",1459438519943668,"t",1459438519]]}
]
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459778074-10759-3-git-send-email-nyh@scylladb.com>
In the tombstone_merging test, we expected one row tombstone. But we did
not verify that in addition to that row tombstone, there is no other rows
(deleted or otherwise). It turns out that in the onld merging algorithm,
we did produce additional deleted rows which shouldn't have been there.
So this patch adds a test that there are no such additional deleted rows
beyond the one row tombstone we expect. The test passes with the new
range tombstone merging algorithm.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459778074-10759-2-git-send-email-nyh@scylladb.com>
In commit 99ecda3c96, we overhauled the
way we read Cassandra's disjoint range tombstones, and convert them to
the overlapping whole-prefix tombstones which we support.
Unfortunately, while this algorithm worked correctly for a couple of
test cases, it did not for additional test cases. While the previous
algorithm could not generate "wrong" tombstones (it didn't generate things
it didn't see), it could generate redundant overlapping tombstones, and
missed some sanity checks about the correctness of the merge process.
In this patch, a new algorithm makes sure to not generate redundant
tombstones, and includes additional tests to ensure that we do not
mistakenly merge range tombstones which cannot actually be merged.
The following patches will include tests which failed with the previous
algorithm, and succeeds with this one.
I described the new algorithm on the ScyllaDB mailing list this way:
1. Have a stack of open ranges, start & timestamp for each (no end for
each), and just one "end of last contiguous deletion"
Processing each range tombstone:
2. If the start of a range tombstone is not adjacent to the "end of last
deletion", assert we have no open range on the stack (because we can
never close those). In any case, set the "end of of last deletion" to
the end of this tombstone.
3. If the current tombstone's timestamp is STRICTLY HIGHER than that on the
top of the stack, push the new tombstone's start+timestamp to the stack.
Note: If it was STRICTLY LOWER, throw error (it means the open range will
never be closed).
4. If the current tombstone's end matches (i.e., closes row) of the start on
the top of the stack, emit this tombstone and pop the stack.
When the row ends:
5. Assert the stack is empty.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459778074-10759-1-git-send-email-nyh@scylladb.com>
* seastar aa281bd...2aeb9dd (20):
> memory: avoid exercising the reclaimers for oversized requests
> tests: test cross-cpu free not underflowing live object counter
> memory: fix live objects counter underflow due to cross-cpu free
> core/reactor: Don't abort in allocate_aligned_buffer() on allocation failure
> build: add --tests-debuginfo, to avoid stripping tests
> connected_socket: Add buffer size arg to output()
> scripts/posix_net_conf.sh: added a support for bonding interfaces
> scripts/posix_net_conf.sh: move the NIC configuration code into a separate function
> scripts/posix_net_conf.sh: implement the logic for selecting default MQ mode
> scripts/posix_net_conf.sh: forward the interface name as a parameter
> http/routes: Remove request failure logging to stderr
> lowres_clock: Initialize _now when the clock is created
> apps/iotune: fix broken URL
> tutorial: expand and improve semaphore section
> DPDK: support set RSS key to port_conf when hash_key_size is unknown
> dpdk: aware of vmxnet3 max xmit frags and do linearizing
> packet_util: insert out of order packet when map is empty
> core: Fix use-after-free of scollectd::impl
> futures: Optimize finally()
> futures: Factor out exceptional path of finally()
"Summary files are a relatively recent addition to Cassandra. I thought
that every SSTable converted to 2.1 would have them, but that does not
seem to be true. It's easy to generate a stream of files that will boot
in Cassandra 2.1 just fine, but not in Scylla as they will be missing
the Summary.
Cassandra can boot those files because they are robust against the Summary
not existing, and we should do the same.
Since we keep the Summary in memory, in case one does not exist we create a
memory copy of it from the Index - the filesystem is not touched. Hopefully,
compaction will run soon and the next time we boot we won't have to do such
thing.
Fixes#1170"
There are cases in which a Summary file will not be present, and imported
SSTables will have just the Index and Data files. In earlier versions of
Cassandra, a Summary didn't exist, so one may not be generated when migrating.
In Issue #1170, we can see an example of tables generated by CQLSSTableWriter,
and they lack a Summary. Cassandra is robust against this and can cope
perfectly with the Summary not existing. I will argue that we should do the
same.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We do that by bailing immediately if we detect that the components
map is already populated. This allow us to call read_toc() earlier
if we need to - for instance, to inquire about the existence of the
Summary - without the need to re-read the components again later.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
for prepare_summary we can just pass the min interval as a parameter and
avoid having the schema do yet another hop. For sealing the summary, it
is completely unused and we can do away with it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This is done so we can use other consumers. An example of that, is regeneration
of the Summary from an existing Index.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Because just creating an SSTable object does not generate any I/O,
get_sstable_key_range should be an instance method. The main advantage
of doing that is that we won't have to read the summary twice. The way
we're doing it currently, if happens to be a shard-relevant table we'll
call load() - which reads the summary again.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are times in which we read the Summary file twice. That actually happens
every time during normal boot (it doesn't during refresh). First during
get_sstable_key_range and then again during load().
Every summary will have at least one entry, so we can easily test for whether
or not this is properly initialized.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test.
Broken by f15c380a4f, where the
calcualtion of has_ck_selector got broken, in such a way that present
clustering restrictions were treated as if not present, which resulted
in static row being returned when it shouldn't.
While at it, unify the check between query_compacted() and
do_compact() by extracting it to a function.
The first erase_and_dispose(), which removes rows between last
position and beginning of the next range, can invalidate end()
iterator of the range. Fix by looking up end after erasing.
mutation_partition::range() was split into lower_bound() and
upper_bound() to allow for that.
This affects for example queries with descending order where the
selected clustering range is empty and falls before all rows.
Exposed by f15c380a4f, which is now
calling do_compact() during query.
Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test
"Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.
The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.
perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.
Fixes #1165."
"There is a need to have an ability to detect whether a feature is
supported by entire cluster. The way to do it is to advertise feature
availability over gossip and then each node will be able to check if all
other nodes have a feature in question.
The idea is to have new application state SUPPORTED_FEATURES that will contain
set of strings, each string holding feature name.
This series adds API to do so.
The following patch on top of this series demostreates how to wait for features
during boot up. FEATURE1 and FEATURE2 are introduced. We use
wait_for_feature_on_all_node to wait for FEATURE1 and FEATURE2 successfully.
Since FEATURE3 is not supported, the wait will not succeed, the wait will timeout.
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -95,7 +95,7 @@ sstring storage_service::get_config_supported_features() {
// Add features supported by this local node. When a new feature is
// introduced in scylla, update it here, e.g.,
// return sstring("FEATURE1,FEATURE2")
- return sstring("");
+ return sstring("FEATURE1,FEATURE2");
}
std::set<inet_address> get_seeds() {
@@ -212,6 +212,11 @@ void storage_service::prepare_to_join() {
// gossip snitch infos (local DC and rack)
gossip_snitch_info().get();
+ gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE1"), sstring("FEATURE2")}, std::chrono::seconds(30)).get();
+ logger.info("Wait for FEATURE1 and FEATURE2 done");
+ gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE3")}).get();
+ logger.info("Wait for FEATURE3 done");
+
We can query the supported_features:
cqlsh> SELECT supported_features from system.peers;
supported_features
--------------------
FEATURE1,FEATURE2
FEATURE1,FEATURE2
(2 rows)
cqlsh> SELECT supported_features from system.local;
supported_features
--------------------
FEATURE1,FEATURE2
(1 rows)"
Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.
The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.
perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.
Fixes#1165.
Advertise features supported by this node, so that other nodes can know
this info. For example, on a 3 node cluster with supported_features ==
FEATURE1 and FEATURE2, it looks like:
cqlsh> SELECT supported_features from system.peers;
supported_features
--------------------
FEATURE1,FEATURE2
FEATURE1,FEATURE2
(2 rows)
cqlsh> SELECT supported_features from system.local;
supported_features
--------------------
FEATURE1,FEATURE2
(1 rows)
It tells features supported by this local node. When new feature is
introduced in scylla, update features returned by
get_config_supported_features, e.g.,
return sstring("FEATURE1,FEATURE2")
API to wait for features are available on a node or all the nodes in the
cluster.
$timeout specifies how long we want to wait. If the features are not
availabe yet, sleep 2 seconds and retry.
- Get features supported by this particular node
std::set<sstring> get_supported_features(inet_address endpoint) const;
- Get features supported by all the nodes this node knows about
std::set<sstring> get_supported_features() const;
After this change, user can query compression ratio on a per column
family basis with 'nodetool cfstats'.
look at 'nodetool cfstats' output:
./bin/nodetool cfstats ks.test5
Keyspace: ks
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: test5
SSTable count: 1
Space used (live): 4774
Space used (total): 4774
Space used by snapshots (total): 0
Off heap memory used (total): 131384
SSTable Compression Ratio: 0.833333
...
Fixes#636.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <a1bee5a23fe63787df3e387a88f2d216ba4a4134.1459802771.git.raphaelsc@scylladb.com>
"batchlog_manager is modified to allow the storage_service to initate a bachlog
replay operation.
Refs #1085.
Tested with tests/batchlog_manager_test and batch_test.py"
With big rows I see contention in XFS allocations which cause reactor
thread to sleep. Commitlog is a main offender, so enlarge extent to
commitlog segment size for big files (commitlog and sstable Data files).
Message-Id: <20160404110952.GP20957@scylladb.com>
This is a left over from the re ordering of the API init. The api_doc
should be set first, so later API registration will enable their
relevent swagger doc.
Currently, the swagger documentation of the system API is not available.
Fixes#1160
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1459750490-15996-1-git-send-email-amnon@scylladb.com>
Both build_rpm.sh and build_deb.sh will fail with "cannot stat 'xxx': No such file or directory" when scylla-server package is not installed, need to prevent it by --no-dereference option of cp.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459523585-9108-1-git-send-email-syuu@scylladb.com>
This is another unit test for range tombstone merging, introduced in commit
0fc9a5ee4d and rewritten in commit
99ecda3c96.
In this test, a single large deletion was broken up into several smaller
ranges, all with the same time stamps, so we should recombine them into
one row tombstone, instead of failing the read.
The sstable in this test case was artificially created using json2sstable.
We don't know how yet to produce such a case using Cassandra 2, but we
have seen a similar occurance in the wild, in a real SSTable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459429243-15821-1-git-send-email-nyh@scylladb.com>
Logging is used in many places including those that shouldn't really
throw any exceptions (destructors, noexcept functions).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
abstract_read_executor::reconcile() is supposed to make sure that
_result_promise is eventually set to either a result or an exception.
That may not happen however if reconciliation throws any exception
since only read timeouts are being caught. When that happends the
continuation chain becomes stuck.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
HyperLogLog constructor promises that it only throws instances of
std::invalid_argument. That's a lie since it also adds elements to a
vector (and doesn't catch potential bad_allocs).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Cassandra-derived tools (such as sstable2json) may write commitlog segments,
that Scylla cannot recognize. Since we now write them with a distinct name,
we can recognize the name and ignore these segments, as we know the data they
contain is not interesting.
Fixes#1112.
Message-Id: <1459356904-20699-1-git-send-email-avi@scylladb.com>
The check_marker() function is use as a sanity-check of data we read
from sstable, so instead of the header file key.hh, let's move it to
the sstable-parsing source file partition.cc.
In addition to having less code in header files, another benefit is
that the function can now throw a more specific exception (malformed
sstable exception).
Also fixed the exception's message (which had a second "%d" but only
one parameter).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459420430-5968-1-git-send-email-nyh@scylladb.com>
Until recently, we believed that range tombstones we read from sstables will
always be for entire rows (or more generalized clustering-key prefixes),
not for arbitrary ranges. But as we found out, because Cassandra insists
that range tombstones do not overlap, it may take two overlapping row
tombstones and convert them into three range tombstones which look like
general ranges (see the patch for a more detailed example).
Not only do we need to accept such "split" range tombstones, we also need
to convert them back to our internal representation which, in the above
example, involves two overlapping tombstones. This is what this patch does.
This patch also contains a test for this case: We created in Cassandra
an sstable with two overlapping deletions, and verify that when we read
it to Scylla, we get these two overlapping deletions - despite the
sstable file actually having contained three non-overlapping tombstones.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <b7c07466074bf0db6457323af8622bb5210bb86a.1459399004.git.glauber@scylladb.com>
The default shell in Ubuntu is "dash" which causes the following error
when "scylla-start" script is executed:
/start-scylla: 8: /start-scylla: source: not found
Message-Id: <1459406561-20141-1-git-send-email-penberg@scylladb.com>
This patch overrides the antlr3 function that allocates the missing
tokens that would eventually leak. The override stores these tokens in
a vector, ensuring memory is freed whenever the parser is destroyed.
Fixes#1147
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1459355146-17402-1-git-send-email-duarte@scylladb.com>
antlr3 leaks the token itself creates when recovering from a mismatch in
the case the missing token can be determined. Until this bug is fixed
or circumvented, the test should remain disabled.
Ref #1147
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1459345403-8243-1-git-send-email-duarte@scylladb.com>
During decommission, the storage_service::unbootstrap() needs to
initiate a batchlog replay operation. To sync the replay operation
initiated by the timer in batchlog_manager and storage_service, a
semaphore is introduced. To simplify the semaphore locking, the
management code now always runs on shard zero, but the real work is
distruted to all shards.
This is a rewrite of Glauber's earlier patch to do the same thing, taking
into account Avi's comments (do not use a class, do not throw from the
constructor, etc.). I also verified that the actual use case which was
broken in #1136 was fixed by this patch.
Currently, we have no support for range tombstones because CQL will not
generate them as of version 2.x. Thrift will, but we can safely leave this for
the future.
However, we have seen cases during a real migration in which a pure-CQL
Cassandra would generate range tombstones in its SSTables.
Although we are not sure how and why, those range tombstones were of a special
kind: their end and next's start range were adjacent, which means that in
reality, they could very well have been written as a single range tombstone for
an entire clustering key - which we support just fine.
This code will attempt to fix this problem temporarily by merging such ranges
if possible. Care must be taken so that we don't end up accepting a true
generic range tombstone by accident.
Fixes#1136
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459333972-20345-1-git-send-email-nyh@scylladb.com>
As Nadav noticed in his bug report, check_marker is creating its error messages
using characters instead of numbers - which is what we intended here in the
first place.
That happens because sprint(), when faced with an 8-byte type, interprets this
as a character. To avoid that we'll use uint16_t types, taking care not to
sign-extend them.
The bug also noted that one of the error messages is missing a parameter, and
that is also fixed.
Fixes#1122
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <74f825bbff8488ffeb1911e626db51eed88629b1.1459266115.git.glauber@scylladb.com>
When we, for some reason, fail to compact an SSTable, we do not log the file
name leaving us with cryptic messages that tell us what happened, but not where
it happened.
This patch adds logging in compaction so that we'll know what's going on.
Please note that readers are more of a concern, because the SSTable being
written technically do not exist yet. Still, better safe than sorry: if
open_data fails, or we leave an unfinished SSTable, it is still good to know
which one was the culprit.
Some argument can be made about whether we should log this at the lower SSTable
level, or at the compaction level.
The reason I am logging this at the compaction level, is that we don't really
know which exception will trigger, and where: it may be the case that we're
seeing exceptions that are not SSTable specific, and may not have the chance to
log it properly.
In particular, if the exception happens inside the reader: read_rows() and
friends only return a mutation reader, which doesn't really do anything until
we call read(). But at that time, we don't hold any pointers to the SSTable
anymore.
In Summary, logging at the compaction level guarantees that we always do it no
matter what. Exceptions that are part of the main SSTable path can log the file
name as well if they want: if that's the case, we'll be left with the name
appearing twice. That's totally harmless, and better than none.
Fixes#1123
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c5c969fb6aeb788a037bd7a4ea69979c1042cb34.1459263847.git.glauber@scylladb.com>
commitlog's sync period is initialized as the batch period, and not as the
sync period itself as it should be.
I've found this by code inspection, but unless I am missing something
really fundamental, this seems to be completely wrong. It's been working
fine because in our defaults, I have checked that both variables default to
the same value. But it seems to me that as long as anyone would change one
of them, the behavior wouldn't be as expected.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2e7c565242fe5d4481a3ee8b0ba425ef14f5e42a.1459252783.git.glauber@scylladb.com>
Sysv init script was added just for prevent warning message on lintian,
never really used by Ubuntu users. Result of that, we often break this
script since upstart/systemd unit file frequently changed. It may
confuse users, it's better to use Upstart only, just like Fedora/CentOS.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459177601-20269-2-git-send-email-syuu@scylladb.com>
On some Fedora environments such as Fedora official AMI, dnf-yum package
is not installed by default, causes command not found error when we run
our setup scripts.
To prevent this, we need to add dnf-yum to scylla-server package
dependency.
Fixes#1106
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459099744-23068-1-git-send-email-syuu@scylladb.com>
After 4e52b41a4, remove_by_toc_name() became aware of temporary TOC
files, however, it doesn't consider that some components may be
missing if temporary TOC is present.
When creating a new sstable, the first thing we do is to write all
components into temporary TOC, so content of a temporary TOC isn't
reliable until it is renamed.
Solution is about implementing the following flow (described by Avi):
"Flow should be:
- remove all components in parallel
- forgive ENOENT, since the compoent may not have been written;
otherwise deletion error should be raised
- fsync the directory
- delete the temporary TOC
"
This problem can be reproduced by running compaction without disk
space, so compaction would fail and leave a partial sstable that would
be marked for deletion. Afterwards, remove_by_toc_name() would try to
delete a component that doesn't exist because it looked at the content
of temporary TOC.
Fixes#1095.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>
We had a problem reading certain existing Cassandra sstables into
Scylla.
Our consume_range_tombstone() function assumes that the start and end
columns have a certain "end of component" markers, and want to verify
that assumption. But because of bugs in older versions of Cassandra,
see https://issues.apache.org/jira/browse/CASSANDRA-7593, sometimes the
"end of component" was missing (set to 0). CASSANDRA-7593 suggested
this problem might exist on the start column, so we allowed for that,
but now we discovered a case where also the end column is set to 0 -
causing the test in consume_range_tombstone() to fail and the sstable
read to fail - causing Scylla to no be able to import that sstable from
Cassandra. Allowing for an 0 also on the end column made it possible
to read that sstable, compact it, and so on.
Fixes#1125.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459173964-23242-1-git-send-email-nyh@scylladb.com>
Seastar wrongly limits the number of concurrent submit_to()s to a single
remote shard. This can cause an ABBA deadlock:
fiberA fiberB (x127)
submit_to(0) # lock schema
<- returns
submit_to(0) # lock schema (waits)
submit_to(0) # do work (waits)
The fiberBs wait for fiberA, which in turn waits for a fiberB to return.
While the correct fix is to remote the client-side limit and replace it
with a server-side per-verb limit, we start with a simpler fix that
replaces the blocking lock call with a non-blocking call, removing the
deadlock.
Fixes#1088.
Message-Id: <1459095357-28950-1-git-send-email-avi@scylladb.com>
Problem found by dtest which loads sstables with generation 1 and 2 into an
empty column family. The root of the problem is that reshuffle procedure
changes new sstables to start from generation 2 at least. So reshuffle could
try to set generation 1 to 2 when generation 2 exists.
This problem can be fixed by starting from generation 1 instead, so reshuffle
would handle this case properly.
Fixes#1099.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <88c51fbda9557a506ad99395aeb0a91cd550ede4.1458917237.git.raphaelsc@scylladb.com>
While Seastar in general can accept any parameter for its I/O queues, Scylla
in particular shouldn't run with them disabled. Such will be the status when
the max-io-requests parameter is not enabled.
On top of that, we would like to have enough depth per I/O queue not to allow
for shard-local parallelism. Therefore, we will require a minimum per-queue
capacity of 4. In machines where the disk iodepth is not enough to allow for 4
concurrent requests per shard, one should reduce the number of I/O queues.
For --max-io-requests, we will check the parameter itself. However, the
--num-io-queues parameter is not mandatory, and given enough concurrent
requests, Seastar's default configuration can very well just be doing the right
thing. So for that, we will check the final result of each I/O queue.
As it is the case with other checks of the sorts, this can be overridden by
the --developer-mode switch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <63bf7e91ac10c95810351815bb8f5e94d75592a5.1458836000.git.glauber@scylladb.com>
Fixes#797
To make sure an inopportune crash after truncate does not leave
sstables on disk to be considered live, and thus resurrect data,
after a truncate, use delete function that renames the TOC file to
make sure we've marked sstables as dead on disk when we finish
this discard call.
Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>
Note: "normal" remove_by_toc_name must now be prepared for and check
if the TOC of the sstable is already moved to temp file when we
get to the juicy delete parts.
Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>
- Do nothing in case the session is closed, to prevent we fire up the
timer again
- Print log info when no progress has been made if the time expires, it
is very useful to debug a idle session
- Grab a reference when the keep alive timer is running
Message-Id: <9f2cc3164696905a6a39c0d072a980765d598dfd.1458782956.git.asias@scylladb.com>
"The following patches are reverted becasue they were thought they break
Glauber's "Make sure repairs do not cripple incoming load" series. It turns out
these two patches just made another bug more visisble. The bug is fixed in
c2eff7e824 (streaming: Complete receive task
after the flush). We can bring the two patches back now.
Passed repair_additional_test.py and update_cluster_layout_tests.py with smp 2."
Currently start() is not prepared to handle exceptions thrown from
service initialization. It's easy to trigger such exceprion by
starting two tests at the same time, which will result in socket bind
error.
Exception thrown from start() typically results in assertion failures
like this one:
seastar::sharded<Service>::~sharded() [with Service = database]: Assertion `_instances.empty()' failed.
This patch fixes the problem by combining start() and stop() in a
single do_with() and using RAII for stopping services.
Now exceptions thrown from service initialization should stop services
in proper order and let the original exception to pass
through. Example result:
fatal error in "test_new_schema_with_no_structural_change_is_propagated": std::runtime_error: bind: Address already in use
Message-Id: <1458768018-27662-1-git-send-email-tgrabiec@scylladb.com>
scylla_io_seup requires the scylla-server env to be setup to run
correctly. previously scylla_io_setup was encapsulated in
scylla-io.service that assured this.
extracting CPUSET,SMP from SCYLLA_ARGS as CPUSET is needed for invoking
io_tune
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <d49af9cb54ae327c38e451ff76fe0322e64a5f00.1458747527.git.shlomi@scylladb.com>
"This series makes sure that the influence of repairs on the ongoing loads is
limited. This patch does not fix the situation completely, but it will be the
best we can do for 1.0
Here's a brief explanation about some potentially contentions points, and future work:
1) With the old parallelism semaphore in tree, we could never really drop parallelism
below 256, since even with (local) parallelism = 1, we would still have 256 vnodes. So while
the number 100 is totally empirical, we know for a fact that around 200-something, we
start having real trouble. (total) parallelism = 100 is enough to allow us to survive
a load as much as 3 times heavier than the load described in Issue944. So while it is
empirical, at least it is based on something
2) I totally support changing the checksumming algorithm. However, I would rather focus
my efforts on testing this to exhaustion than doing this at the moment. But if anybody
wants to do it, I think it is a great thing to have before 1.0. Specially because
we'll probably need a new verb for that, so we would be better off having it from the start
3) This problem was made harder due to the fact that there are three conditions really that
can affect the ongoing load. Only one of them needs to trigger for us to see degradation, so
fixing them individually will usually buy us nothing.
Those are:
a) The disk bandwidth. Since the mutations are all together in the same memtable/commitlog
as normal memtables, we can differentiate between them from the I/O Scheduler perspective.
This is not an issue of course if the incoming mutations are not enough for us to saturate
the disk, but specially given the highly parallel nature of repair, we usually will. If
the commitlog queue starts getting too big, for instance, new requests will start being
put to wait. The effect of this part of the series is to *completely* shift the high waiting
times from those classes to the streaming ones (unfortunately compaction is still affected,
but that's fine IMHO). With the new streaming classes, the waiting time of a memtable / commitlog
requests is still kept in the microseconds range. The streaming classes, on the other hand,
will be in the hundreds of milliseconds range, or even seconds.
b) The memory consumption: since the whole problem that leads to a) is the fact that due to high
disk activity some requests will have to wait, we will end up with a lot of streaming memtables
not yet flushed. Because of that, we will start throttling new incoming CQL requests and all
the isolation efforts are rendered useless. Once again, due to the highly parallel nature of
repair, this turned out to be a very easy condition to trigger. The solution proposed here
is to limit a maximum amount of dirty memory for the repair job (in here, 25 %). This way,
we can endure even slightly heavier loads without sweating too much.
c) The task scheduler: repair generates a ton of requests for range checksums, and we actually
want to keep it that way - so that the ranges checksummed are small enough so we don't have
to resend a lot of mutations for no reason. However, if we pile up thousands of continuations
in the task scheduler, seastar has absolutely no mechanism (right now) to prioritize between
different kinds of requests. That means that the continuations that are supposed to be handling
user requests will simply not for a long time. Even if the Seastar load is less than 100 %
that is still a problem, since that is just adding hundreds of milliseconds worth of latencies to
any request processing.
Fixes#944 and fixes #1033."
A STREAM_MUTATION_DONE message will signal the receiver that the sender
has completed the sending of streams mutations. When the receiver finds
it has zero task to send and zero task to receive, it will finish the
stream_session, and in turn finish the stream_plan if all the
stream_sessions are finished. We should call receive_task_completed only
after the flush finishes so that when stream_plan is finshed all the
data is on disk.
Fixes repair_disjoint_data_test issue with Glauber's "[PATCH v4 0/9] Make
sure repairs do not cripple incoming load" serries
======================================================================
FAIL: repair_disjoint_data_test
(repair_additional_test.RepairAdditionalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "scylla-dtest/repair_additional_test.py",
line 102, in repair_disjoint_data_test
self.check_rows_on_node(node1, 3000)
File "scylla-dtest/repair_additional_test.py",
line 33, in check_rows_on_node
self.assertEqual(len(result), rows, len(result))
AssertionError: 2461
The repair code as it is right now is a bit convoluted: it resorts to detached
continuations + do_for_each when calling sync_ranges, and deals with the
problem of excessive parallelism by employing a semaphore inside that range.
Still, even by doing that, we still generate a great number of
checksum requests because the ranges themselves are processed in parallel.
It would be better to have a single-semaphore to limit the overall parallelism
for all requests.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Theoretically, because we can have a lot of pending streaming memtables, we can
have the database start throttling and incoming connections slowing down during
streaming.
Turns out this is actually a very easy condition to trigger. That is basically
because the other side of the wire in this case is quite efficient in sending
us work. This situation is alleviated a bit by reducing parallelism, but not
only it does't go away completely, once we have the tools to start increasing
parallelism again it will become common place.
The solution for this is to limit the streaming memtables to a fraction of the
total allowed dirty memory. Using the nesting capability built in in the LSA
regions, we will make the streaming region group a child of the main region
group. With that, we can throttle streaming requests separately, while at the
same time being able to control the total amount of dirty memory as well.
Because of the property, it can still be the case that incoming requests will
throttle earlier due to streaming - unless we allow for more dirty memory to be
used during repairs - but at least that effect will be limited.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The repair process will potentially send ranges containing few mutations,
definitely not enough to fill a memtable. It wants to know whether or not each
of those ranges individually succeeded or failed, so we need a future for each.
Small memtables being flushed are bad, and we would like to write bigger
memtables so we can better utilize our disks.
One of the ways to fix that, is changing the repair itself to send more
mutations at a single batch. But relying on that is a bad idea for two reasons:
First, the goals of the SSTable writer and the repair sender are at odds. The
SSTable writer wants to write as few SSTables as possible, while the repair
sender wants to break down the range in pieces as small as it can and checksum
them individually, so it doesn't have to send a lot of mutations for no reason.
Second, even if the repair process wants to process larger ranges at once, some
ranges themselves may be small. So while most ranges would be large, we would
still have potentially some fairly small SSTables lying around.
The best course of action in this case is to coalesce the incoming streams
write-side. repair can now choose whatever strategy - small or big ranges - it
wants, resting assure that the incoming memtables will be coalesced together.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Keeping the mutations coming from the streaming process as mutations like any
other have a number of advantages - and that's why we do it.
However, this makes it impossible for Seastar's I/O scheduler to differentiate
between incoming requests from clients, and those who are arriving from peers
in the streaming process.
As a result, if the streaming mutations consume a significant fraction of the
total mutations, and we happen to be using the disk at its limits, we are in no
position to provide any guarantees - defeating the whole purpose of the
scheduler.
To implement that, we'll keep a separate set of memtables that will contain
only streaming mutations. We don't have to do it this way, but doing so
makes life a lot easier. In particular, to write an SSTable, our API requires
(because the filter requires), that a good estimate on the number of partitions
is informed in advance. The partitions also need to be sorted.
We could write mutations directly to disk, but the above conditions couldn't be
met without significant effort. In particular, because mutations can be
arriving from multiple peer nodes, we can't really sort them without keeping a
staging area anyway.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Streaming has currently one class, that can be used to contain the read
operations being generated by the streaming process. Those reads come from two
places:
- checksums (if doing repair)
- reading mutations to be sent over the wire.
Depending on the amount of data we're dealing with, that can generate a
significant chunk of data, with seconds worth of backlog, and if we need to
have the incoming writes intertwined with those reads, those can take a long
time.
Even if one node is only acting as a receiver, it may still read a lot for the
checksums - if we're talking about repairs, those are coming from the
checksums.
However, in more complicated failure scenarios, it is not hard to imagine a
node that will be both sending and receiving a lot of data.
The best way to guarantee progress on both fronts, is to put both kinds of
operations into different classes.
This patch introduces a new write class, and rename the old read class so it
can have a more meaningful name.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The column family still has to teach the memtable list how to allocate a new memtable,
since it uses CF parameters to do so.
After that, the memtable_list's constructor takes a seal and a create function and is complete.
The copy constructor can now go, since there are no users left.
The behavior of keeping a reference to the underlying memtables can also go, since we can now
guarantee that nobody is keeping references to it (it is not even a shared pointer anymore).
Individual memtables are, and users may be keeping references to them individually.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Each list can have a different active memtable. The column family method keeps
existing, since the two separate sets of memtable are just an implementation
detail to deal with the problem of streaming QoS: *the* active memtable keeps
being the one from the main list.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
memtable_list is currently just an alias for a vector of memtables. Let's move
them to a class on its own, exporting the relevant methods to keep user code
unchanged as much as possible.
This will help us keeping separate lists of memtables.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Since initialization now runs in a thread storage, messaging and
gossiper services initialization code may take advantage of it too.
Message-Id: <20160323094732.GF2282@scylladb.com>
Vlad and I were working on finding the root of the problems with
refresh. We found that refresh was deleting existing sstable files
because of a bug in a function that was supposed to return the maximum
generation of a column family.
The intention of this function is to get generation from last element
of column_family::_sstables, which is of type std::map.
However, we were incorrectly using std::map::end() to get last element,
so garbage was being read instead of maximum generation.
If the garbage value is lower than the minimum generation of a column
family, then reshuffle_sstables() would set generation of all existing
sstables to a lower value. That would confuse our mechanism used to
delete sstables because sstables loaded at boot stage were touched.
Solution to this problem is about using rbegin() instead of end() to
get last element from column_family::_sstables.
The other problem is that refresh will only load generations that are
larger than or equal to X, so new sstables with lower generation will
not be loaded. Solution is about creating a set with generation of
live SSTables from all shards, and using this set to determine whether
a generation is new or not.
The last change was about providing an unused generation to reshuffle
procedure by adding one to the maximum generation. That's important to
prevent reshuffle from touching an existing SSTable.
Tested 'refresh' under the following scenarios:
1) Existing generations: 1, 2, 3, 4. New ones: 5, 6.
2) Existing generations: 3, 4, 5, 6. New ones: 1, 2.
3) Existing generations: 1, 2, 3, 4. New ones: 7, 8.
4) No existing generation. No new generation.
5) No existing generation. New ones: 1, 2.
I also had to adapt existing testcase for reshuffle procedure.
Fixes#1073.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <1c7b8b7f94163d5cd00d90247598dd7d26442e70.1458694985.git.raphaelsc@scylladb.com>
Currently we execute all statements in parallel, but some statements
depend on order, in particular list append/prepend. Fix by executing
sequentially.
Fixes cql_additional_tests.py:TestCQL.batch_and_list_test dtest.
Fixes#1075.
Message-Id: <1458672874-4749-1-git-send-email-tgrabiec@scylladb.com>
"The test predates LSA zones and was not anticipating that LSA would
take much more free memory from the system than it needs in its assertions.
Fix by accounting for the fact properly."
"This implements #1065
- iotune will NOT be a part of scylla service - remove the scylla.io.service
- User will have to run it manually - using a script call scylla_io_tune_setup (that will do the exact same thing the service does today.
- if they wont, and do not use --developer-mode, scylla init will fail will a proper error - scylla will not start (in the same manner it does not start if you run scylla on non XFS FS)
- For c3,m3,i2 we will use the evaluation formula we have (that takes the number of disks , cores etc.)
- For other instances we will set --developer-mode. if the user logins into the instance - he will get a developer-mode warning
- No iotune on AWS"
Fixes#1065.
Fixes the following assertion failure:
row_cache_alloc_stress: tests/row_cache_alloc_stress.cc:120: main(int, char**)::<lambda()>::<lambda()>: Assertion `mt->occupancy().used_space() < memory::stats().free_memory()' failed.
memory::stats()::free_memory() may be much lower than the actual
amount of reclaimable memory in the system since LSA zones will try to
keep a lot of free segments to themselves. Fix by using actual amount
of reclaimable memory in the check.
Previosly if the user did not specify any metrics, scyllatop use
whatever it could find. Now we have some preset defaults which are
probably more interesting.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1458658804-377-1-git-send-email-yoav@scylladb.com>
* seastar c193821...9f2b868 (4):
> memory: set free memory to non-zero value in debug mode
> Merge "Increase IOTune's robustness by including a timeout" from Glauber
> shared_future: add companion class, shared_promise
> rpc: fix client connection stopping
Below are 3 possible cases in a stream session, after commit
208b7fa7ba (streaming: Simplify session completion logic) We might
close the session before the exchange of the PREPARE_DONE_MESSAGE
message in case 1). To fix, we defer the sending of mutations after
PREPARE_DONE_MESSAGE is sent at the initiator node.
1)
Initiator Follower
tx rx tx rx
1 0 0 1
send prepare
send back prepare
recev prepare
send mutations (close the session before prepare_done msg is sent)
recv mutations (close session before prepare_done msg is received)
send prepare_done
recv prepare_done and send no mutations
2)
Initiator Follower
tx rx tx rx
0 1 1 0
send prepare
send back prepare
recv prepare
nothing to send
send prepare_done
recv prepare_done and send mutations (close session)
recv mutations (close session)
3)
Initiator Follower
tx rx tx rx
1 1 1 1
send prepare
send back prepare
recv prepare
send mutations
recv mutations, can not close session since we have mutations to send
send prepare_done
recv prepare_done and send mutations (close session)
recv mutations (close session)
Message-Id: <d6510b558565db23202164fa491b883ef3796e58.1458634037.git.asias@scylladb.com>
Messaging service stop() method calls stop() on all clients. If
remove_rpc_client_one() is called while those stops are running
client::stop() will be called twice which not suppose to happen. Fix it
by ignoring client remove request during messaging service shutdown.
Fixes#1059
Message-Id: <1458639452-29388-2-git-send-email-gleb@scylladb.com>
Take a reference of messaging_service object inside
send_message_timeout_and_retry to make sure it is not freed during the
life time of send_message_timeout_and_retry operation.
The version number ordering rules are different for rpm and deb. Use
tilde ('~') for the latter to ensure a release candidate is ordered
_before_ a final version.
Message-Id: <1458627524-23030-1-git-send-email-penberg@scylladb.com>
"Adds an extension function SCYLLA_TIMEUUID_LIST_INDEX to CQL syntax
for collection element indexing, which, if the target is a list,
will attempt to directly index the list (which is really a map)
by the ordering time uuid (as index parameter)."
"We cannot leave partially applied mutation behind when the write
fails. It may fail if memory allocation fails in the middle of
apply(). This for example would violate write atomicity, readers
should either see the whole write or none at all.
This fix makes apply() revert partially applied data upon failure, by
the means of ReversiblyMergeable concept. In a nut shell the idea is
to store old state in the source mutation as we apply it and swap back
in case of exception. At cell level this swapping is inexpensive, just
rewiring pointers. For this to work, the source mutation needs to be
brought into mutable form, so frozen mutations need to be unfrozen. In
practice this doesn't increase amount of cell allocations in the
memtable apply path because incoming data will usually be newer and we
will have to copy it into LSA anyway. There are extra allocations
though for the data structures which holds cells.
I didn't see significant change in performance of:
build/release/tests/perf/perf_simple_query -c1 -m1G --write --duration 13
The score fluctuates around ~77k ops/s.
The change was tested with a unit test (patch to mutation_test) which generates
random mutations and injects allocation failures at every possible allocation
site in the apply path. This also uncovered other preexisting bugs."
Commit 6a3872b355 fixed some use-after-free
bugs but introduced a new one because of a typo:
Instead of capturing a reference to the long-living io-class object, as
all the code does, one place in the code accidentally captured a *copy*
of this object. This copy had a very temporary life, and when a reference
to that *copy* was passed to sstable reading code which assumed that it
lives at least as long as the read call, a use-after-free resulted.
Fixes#1072
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458595629-9314-1-git-send-email-nyh@scylladb.com>
The test injects allocation failures at every allocation site during
apply(). Only allocations throug allocation_strategy are instrumented,
but currently those should include all allocations in the apply() path.
The target and source mutations are randomized.
We cannot leave partially applied mutation behind when the write
fails. It may fail if memory allocation fails in the middle of
apply(). This for example would violate write atomicity, readers
should either see the whole write or none at all.
This fix makes apply() revert partially applied data upon failure, by
the means of ReversiblyMergeable concept. In a nut shell the idea is
to store old state in the source mutation as we apply it and swap back
in case of exception. At cell level this swapping is inexpensive, just
rewiring pointers. For this to work, the source mutation needs to be
brought into mutable form, so frozen mutations need to be unfrozen. In
practice this doesn't increase amount of cell allocations in the
memtable apply path because incoming data will usually be newer and we
will have to copy it into LSA anyway. There are extra allocations
though for the data structures which holds cells.
I didn't see significant change in performance of:
build/release/tests/perf/perf_simple_query -c1 -m1G --write --duration 13
The score fluctuates around ~77k ops/s.
Fixes#283.
Currently only "set" storage could store empty cells, but not the
"vector" one because there empty cell has the meaning of being
missing. To implement rolback, we need to be able to distinguish empty
cells from missing ones. Solve by making vector storage use a bitmap
for presence checking instead of emptiness. This adds 4 bytes to
vector storage.
It is needed for noexcept destruction, which we need for exception
safety in higher layers.
According to [1], erase() only throws if key comparison throws, and in
our case it doesn't.
[1] http://en.cppreference.com/w/cpp/container/unordered_map/erase
Both the initiator and follower of a stream session knows how many
transfer task and receive task the stream session contains in the
preparation phase. They use the _transfers and _receivers map to track
the tasks, like below:
std::map<UUID, stream_transfer_task> _transfers;
std::map<UUID, stream_receive_task> _receivers;
A stream_transfer_task will send STREAM_MUTATION verb to transfer data
with frozen_mutation, when all the STREAM_MUTATIONs are sent, it will
send STREAM_MUTATION_DONE to tell the peer the stream_transfer_task is
completed and remove the stream_transfer_task from _transfers map. The
peer will remove the corresponding stream_receive_task in _receivers.
We do not really need the COMPLETE_MESSAGE verb to notify the peer we
have completed sending. It makes the session completion logic much
simpler and cleaner if we do not depend on COMPLETE_MESSAGE verb.
However, to be compatible with older version, we always send a
COMPLETE_MESSAGE message and do nothing in the COMPLETE_MESSAGE handler
and replies a ready future even if the stream_session is closed already.
This way, node with older version will get a COMPLETE_MESSAGE message
and manage to send a COMPLETE_MESSAGE message to new node as before.
Message-Id: <1458540564-34277-2-git-send-email-asias@scylladb.com>
On scylla_setup interactive mode we are using lsblk to list up candidate
block devices for RAID, and -p option is to print full device paths.
Since Ubuntu 14.04LTS version of lsblk doesn't supported this option, we
need to use non-full path name and complete paths before passes it to
scylla_raid_setup.
Fixes#1030
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458325411-9870-1-git-send-email-syuu@scylladb.com>
We did the clean up in idl/gossip_digest.idl.hh, but the patch to clean
up gms/application_state.hh was never merged.
To maintain compatibility with previous version of scylla, we can not
change application_state.hh, instead change idl to be sync with
application_state.hh.
Message-Id: <3a78b159d5cb60bc65b354d323d163ce8528b36d.1458557948.git.asias@scylladb.com>
* dist/ami/files/scylla-ami 84bcd0d...56f1ab7 (2):
> Ubuntu AMI support on scylla_install_ami
> scylla_ami_setup is not POSIX sh compatible, change shebang to /bin/bash
* seastar 6a207e1...c193821 (6):
> semaphore: allow wait() and signal() after broken()
> run reactor::stop() only once
> sharded: fix start with reference parameter
> core: add asserts to rwlock
> util/defer: Fix cancel() not being respected
> tcp: Do not return accept until the connection is connected
Since upstart does not have same behavior as systemd, we need to run scylla_io_setup and scylla_ami_setup in scylla-server.conf's pre-start stanza.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
"apt-get -y install mdadm" shows up a dialog to select install mode of postfix, this will block scylla-ami-setup.service forever since it is running as background task, we need to prevent it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
This introduces Ubuntu AMI.
Both CentOS AMI and Ubuntu AMI are need to build on same distribution, so build_ami.sh script automatically detect current distribution, and selects base AMI image.
Fixes#998
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
When NR_CPU >= 8, we disabled cpu0 for AMI on scylla_sysconfig_setup.
But scylla_io_setup doesn't know that, try to assign NR_CPU queues, then scylla fails to start because queues > cpus.
So on this fix scylla_io_setup checks sysconfig settings, if '--smp <n>' specified on SCYLLA_ARGS, use n to limit queue size.
Also, when instance type is not supported pre-configured parameters, we need to passes --cpuset parameters to iotune. Otherwise iotune will run on a different set of CPUs, which may have different performance characteristics.
Fixes#996, #1043, #1046
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458221762-10595-2-git-send-email-syuu@scylladb.com>
Fix the validation error message to look like this:
Scylla version 666.development-20160316.49af399 starting ...
WARN 2016-03-17 12:24:15,137 [shard 0] config - Option partitioner is not (yet) used.
WARN 2016-03-17 12:24:15,138 [shard 0] init - NOFILE rlimit too low (recommended setting 200000, minimum setting 10000; you may run out of file descriptors.
ERROR 2016-03-17 12:24:15,138 [shard 0] init - Bad configuration: invalid 'listen_address': eth0: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> > (Invalid argument)
Exiting on unhandled exception of type 'bad_configuration_error': std::exception
Instead of:
Exiting on unhandled exception of type 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >': Invalid argument
Fixes#1051.
Message-Id: <1458210329-4488-1-git-send-email-penberg@scylladb.com>
Large allocations test, unsurprisingly, allocates a lot of memory. Do
not leak it so that any tests that are going to be run afterwards have
still some memory left.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
_closed_occupancy will be used when a region is removed from its region
group, make sure that it is accurate.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Defer registering services to the API server until commitlog has been
replayed to ensure that nobody is able to trigger sstable operations via
'nodetool' before we are ready for them.
Message-Id: <1458116227-4671-1-git-send-email-penberg@scylladb.com>
For this verb(), we don't call get_session - and it doesn't look like we will.
We currently have no debug message for this one, which makes it harder to debug
the stream of messages. Print it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Whenever we call get_session, that will print a debug message about the arrival
of this new verb. Because we also print that explicitly in PREPARE_DONE, that
message gets duplicated.
That confuses poor developers who are, for a while, left wondering why is it that
the sender is sender the message twice.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Our sstables::mutation_reader has a specialization in which start and end
ranges are passed as futures. That is needed because we may have to read the
index file for those.
This works well under the assumption that every time a mutation_reader will be
created it will be used, since whoever is using it will surely keep the state
of the reader alive.
However, that assumption is no longer true - for a while. We use a reader
interface for reading everything from mutations and sstables to cache entries,
and when we create an sstable mutation_reader, that does not mean we'll use it.
In fact we won't, if the read can be serviced first by a higher level entity.
If that happens to be the case, the reader will be destructed. However, since
it may take more time than that for the start and end futures to resolve, by
the time they are resolved the state of the mutation reader will no longer be
valid.
The proposed fix for that is to only resolve the future inside
mutation_reader's read() function. If that function is called, we can have a
reasonable expectation that the caller object is being kept alive.
A second way to fix this would be to force the mutation reader to be kept alive
by transforming it into a shared pointer and acquiring a reference to itself.
However, because the reader may turn out not to be used, the delayed read
actually has the advantage of not even reading anything from the disk if there
is no need for it.
Also, because sstables can be compacted, we can't guarantee that the sst object
itself , used in the resolution of start and end can be alive and that has the
same problem. If we delay the calling of those, we will also solve a similar
problem. We assume here that the outter reader is keeping the SSTable object
alive.
I must note that I have not reproduced this problem. What goes above is the
result of the analysis we have made in #1036. That being the case, a thorough
review is appreciated.
Fixes#1036
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a7e4e722f76774d0b1f263d86c973061fb7fe2f2.1458135770.git.glauber@scylladb.com>
Asking to read from byte 100 when a file has 50 bytes is an obvious error.
But what if we ask to read from byte 50? What if we ask to read 0 bytes at
byte 50? :-)
Before this patch, code which asked to read from the EOF position would
get an exception. After this patch, it would simply read nothing, without
error. This allows, for example, reading 0 bytes from position 0 on a file
with 0 bytes, which apparently happened in issue #1039...
A read which starts at a position higher than the EOF position still
generates an exception.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458137867-10998-1-git-send-email-nyh@scylladb.com>
The uncompression code reads the compressed chunks containing the bytes
pos through pos + len - 1. This, however, is not correct when len==0,
and pos + len - 1 may even be -1, causing an out-of-range exception when
calling locate() to find the chunks containing this byte position.
So we need to treat len==0 specially, and in this case we don't read
anything, and don't need to locate() the chunks to read.
Refs #1039.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458135987-10200-1-git-send-email-nyh@scylladb.com>
This fixes gossip test shutdown similar to what commit 13ce48e ("tests:
Fix stop of storage_service in cql_test_env") did for CQL tests:
gossip_test: /home/penberg/scylla/seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local() [with Service = net::messaging_service]: Assertion `local_is_initialized()' failed.
Running 1 test case...
[snip]
unknown location(0): fatal error in "test_boot_shutdown": signal: SIGABRT (application abort requested)
seastar/tests/test-utils.cc(32): last checkpoint
Message-Id: <1458126520-20025-1-git-send-email-penberg@scylladb.com>
The cf can be deleted after the cf deletion check. Handle this case as
well.
Use "warn" level to log if cf is missing. Although we can handle the
case, but it is good to distingush where the receiver of streaming
applied all the stream mutations or not. We believe that the cf is
missing because it was dropped, but it could be missing because of a bug
or something we didn't anticipated here.
Related patch: "streaming: Handle cf is deleted when sending
STREAM_MUTATION_DONE"
Fixes simple_add_new_node_while_schema_changes_test failure.
Message-Id: <c4497e0500f50e0a3422efb37e73130765c88c57.1458090598.git.asias@scylladb.com>
It is a legacy API from c*. Since we can wait for the
update_pending_ranges to complete, we can wait for it directly instead
of calling block_until_update_pending_ranges_finished to do so.
Also, change do_update_pending_ranges to be private.
Message-Id: <ac79b2879ec08fdcd3b2278ff68962cc71492f12.1458040608.git.asias@scylladb.com>
The verb is just for reporting and debugging purposes, but it is better
not to register it until it can return a meaningful value. Besides, it
really belongs to the migration manager subsystem anyway.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1458037053-14836-1-git-send-email-pdziepak@scylladb.com>
Streaming is used by bootstrap and repair. Streaming uses storage_proxy
class to apply the frozen_mutation and db/column_family class to
invalidate row cache. Defer the initalization just before repair and
bootstrap init.
Message-Id: <8e99cf443239dd8e17e6b6284dab171f7a12365c.1458034320.git.asias@scylladb.com>
Register the REPAIR_CHECKSUM_RANGE messaging service verb handler after
we have replayed the commitlog to avoid responding with bogus checksums.
Message-Id: <1458027934-8546-1-git-send-email-penberg@scylladb.com>
"At the momment, the migration_listener callbacks returns void, it is impossible
to wait for the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the callbacks
return future<>.
Fixes #1000."
If a keyspace is created after we calcuate the pending ranges during
bootstrap. We will ignore the keyspace in pending ranges when handling
write request for that keyspace which will casue data lose if rf = 1.
Fixes#1000
At the momment, the callbacks returns void, it is impossible to wait for
the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the
callbacks return future<>.
Some network equipment that does TCP session tracking tend to drop TCP
sessions after a period of inactivity. Use keepalive mechanism to
prevent this from happening for our inter-node communication.
Message-Id: <20160314173344.GI31837@scylladb.com>
* seastar 88cc232...0739576 (4):
> rpc: allow configuring keepalive for rpc client
> net: add keepalive configuration to socket interface
> iotune: refuse to run if there is not enough space available
> rpc: make client connection error more clear
Defer registering migration manager RPC verbs after commitlog has has
been replayed so that our own schema is fully loaded before other other
nodes start querying it or sending schema updates.
Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>
The same shard may create an sstables::sstable object for the same SStable
that doesn't belong to it more than once and mark it
for deletion (e.g. in a 'nodetool refresh' flow).
In that case the destructor of sstables::sstable accounted
the deletion requests from the same shard more than once since it was a simple
counter incremented each time there was a deletion request while it should
account request from the same shard as a single request. This is because
the removal logic waited for all shards to agree on a removal of a specific
SStable by comparing the counter mentioned above to the total
number of shards and once they were equal the SStable files were actually removed.
This patch fixes this by replacing the counter by an std::unordered_set<unsigned>
that will store a shard ids of the shards requesting the deletion
of the sstable object and will compare the size() of this set
to smp::count in order to decide whether to actually delete the corresponding
SStable files.
Fixes#1004
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1457886812-32345-1-git-send-email-vladz@cloudius-systems.com>
When we are about to write a new sstable, we check if the sstable exists
by checking if respective TOC exists. That check was added to handle a
possible attempt to write a new sstable with a generation being used.
Gleb was worried that a TOC could appear after the check, and that's indeed
possible if there is an ongoing sstable write that uses the same generation
(running in parallel).
If TOC appear after the check, we would again crap an existing sstable with
a temporary, and user wouldn't be to boot scylla anymore without manual
intervention.
Then Nadav proposed the following solution:
"We could do this by the following variant of Raphael's idea:
1. create .txt.tmp unconditionally, as before the commit 031bf57c1
(if we can't create it, fail).
2. Now confirm that .txt does not exist. If it does, delete the .txt.tmp
we just created and fail.
3. continue as usual
4. and at the end, as before, rename .txt.tmp to .txt.
The key to solving the race is step 1: Since we created .txt.tmp in step 1
and know this creation succeeded, we know that we cannot be running in
parallel with another writer - because such a writer too would have tried to
create the same file, and kept it existing until the very last step of its
work (step 4)."
This patch implements the solution described above.
Let me also say that the race is theoretical and scylla wasn't affected by
it so far.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ef630f5ac1bd0d11632c343d9f77a5f6810d18c1.1457818331.git.raphaelsc@scylladb.com>
Since calculate_pending_ranges will modify token_metadata, we need to
replicate to other shards. With this patch, when we call
calculate_pending_ranges, token_metadata will be replciated to other
non-zero shards.
In addition, it is not useful as a standalone class. We can merge it
into the storage_service. Kill one singleton class.
Fixes#1033
Refs #962
Message-Id: <fb5b26311cafa4d315eb9e72d823c5ade2ab4bda.1457943074.git.asias@scylladb.com>
"This series adds more information (i.e. keys and tombstones) to the
query result digest in order to ensure correctness and increase the
chances of early detection of disagreement between replicas.
The digest is no longer computed by hashing query::result but build
using the query result builder. That is necessary since the query
result itself doesn't contain all information required to compute
the digest. Another consequence of this is that now replicas asked
for a result need to send both the result and the digest to
the coordinator as it won't be able to compute the digest itself.
Unfortunately, these patches change our on wire communication:
1) hash computation is different
2) format of query::result is changed (and it is made non-final)
Fixes #182."
Query result digest is used to verify that all replicas have the same
data. Therefore, it needs to contain more information than the query
result itself in order to ensure proper detection of disagreements.
Generally, adding clustering keys to the digest regardless of whether
the client asked for them will guarantee correctness. However, adding
tombstones as well improves the chances of early detection of nodes
containing stale data.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Result digest is going to be computed in query result builder and
require information not available in the query resylt. That's why the
digest now needs to be sent to the other nodes together with the result
as they won't be able compute it on their own.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently, if there is a disagreement between replicas we get mutations
from all of them, merge this mutations and send the result to the
client, difference between the result and the mutation sent by a
particular replica is sent back to repair it.
Unfortunately, that may not suffice to provide user with correct results
in case of disagreements.
Consider the following scenario:
create table cf(p int, c int, r int, primary key(p, c));
node1:
p=0, c=1, r=1 (timestamp = 1)
p=0, c=2, r=2 (timestamp = 2)
node2:
p=0, c=1, r=tombstone (timestamp = 2)
p=0, c=2, r=1 (timestamp = 1)
query:
select r from cf limit 1;
Let's assume there are no row markers. node1 will send only outdated
cell (p=0, c=1, r=1) while node2 will send both tombstone for c=1 and
outdated cell (p=0, c=2, r=1). A disagreement will be detected, the
replies will be merged and the coordinator will respond to the client
with result r=1, while the correct answer is r=2.
The solution proposed in this patch is to attempt to detect cases when
the problem may occur and retry queries with larger limit which result
in replicas providing more information.
The detection logic is simple: the partition key and clustering key of
the last row in the reconciled result are compared with the partition
keys and clustering keys of the last rows of replies from replicas
(except short reads). If the (pk, ck) of the replica last row is smaller
than the (pk, ck) of the reconciled result the query is retried.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Fix bootstrap_test.py:TestBootstrap.failed_bootstap_wiped_node_can_join_test
Logs on node 1:
INFO 2016-03-11 15:53:43,287 [shard 0] gossip - FatClient 127.0.0.2 has been silent for 30000ms, removing from gossip
INFO 2016-03-11 15:53:43,287 [shard 0] stream_session - stream_manager: Close all stream_session with peer = 127.0.0.2 in on_remove
WARN 2016-03-11 15:53:43,498 [shard 0] stream_session - [Stream #4e411ba0-e75e-11e5-81f8-000000000000] stream_transfer_task: Fail to send STREAM_MUTATION_DONE to 127.0.0.2:0: std::runtime_error ([Stream #4e411ba0-e75e-11e5-81f8-000000000000] GOT STREAM_ MUTATION_DONE 127.0.0.1: Can not find stream_manager)
terminate called without an active exception
Backtrace on node 1:
#0 0x00007fb74723da98 in raise () from /lib64/libc.so.6
#1 0x00007fb74723f69a in abort () from /lib64/libc.so.6
#2 0x00007fb74ab84aed in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007fb74ab82936 in ?? () from /lib64/libstdc++.so.6
#4 0x00007fb74ab82981 in std::terminate() () from /lib64/libstdc++.so.6
#5 0x00007fb74ab82be9 in __cxa_rethrow () from /lib64/libstdc++.so.6
#6 0x0000000000f3521e in streaming::stream_transfer_task::<lambda()>::<lambda(auto:44)>::operator()<std::__exception_ptr::exception_ptr> (ep=..., __closure=0x7ffce74d8630) at streaming/stream_transfer_task.cc:169
#7 do_void_futurize_apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1142
#8 futurize<void>::apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1190
#9 future<>::<lambda(auto:7&&)>::operator()<future<> > ( fut=fut@entry=<unknown type in /home/asias/src/cloudius-systems/scylla/build/release/scylla, CU 0xec84d00, DIE 0xee2561d>, __closure=__closure@entry=0x7ffce74d8630) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1014
Message-Id: <1457684884-4776-2-git-send-email-asias@scylladb.com>
Currently, if sstable::write_components() is called to write a new sstable
using the same generation of a sstable that exists, a temporary TOC will
be unconditionally created. Afterwards, the same sstable::write_components()
will fail when it reaches sstable::create_data(). The reason is obvious
because data component exists for that generation (in this scenario).
After that, user will not be able to boot scylla anymore because there is
a generation with both a TOC and a temporary TOC. We cannot simply remove a
generation with TOC and temporary TOC because user data will be lost (again,
in this scenario). After all, the temporary TOC was only created because
sstable::write_components() was wrongly called with the generation of a
sstable that exists.
Solution proposed by this patch is to trigger exception if a TOC file
exists for the generation used.
Some SSTable unit tests were also changed to guarantee that we don't try
to overwrite components of an existing sstable.
Refs #1014.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <caffc4e19cdcf25e4c6b9dd277d115422f8246c4.1457643565.git.raphaelsc@scylladb.com>
Since commit 2f56577 ("sstables: more efficient read of compressed data
file"), the compressed_file_input_stream uses a file_input_stream to
efficiently read the compressed data at chunks some desired size (128 KB
is our default) instead of at smaller compressed chunks.
However, I had a bug where I mis-calculated the desired length of the
read (giving the *end byte* instead of the length!) and as a result
file_input_stream did not know where the read was supposed to stop, and
always read 128 KB buffers. The results were not incorrect, because the
sstable reader stops when it needs to, even if given too much data. But
it was inefficient because too much data was read in the last buffer.
With this patch, the length is correctly given to the input stream, and
it can read a much smaller buffer at the end of the read, not the full
128 KB. I tested that this actually happens.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457633616-15193-1-git-send-email-nyh@scylladb.com>
"As described in issue #1014, we have found ourselves in a situation where
SSTables can be written too early, and that causes problems for the existing
SSTables. While this shouldn't happen - and Pekka's recent patch to move
populate() a lot earlier in initialization should fix that, when that did
happen what we had was not enough to prevent it from overwriting existing
tables.
We should do a lot better job protecting against that.
Also, some of the exceptions that are generated at totally inconclusive. This
series also aims at making some of the exceptions more descriptive."
This patch makes sure that every time we need to create a new generation number -
the very first step in the creation of a new SSTable, the respective CF is already
initialized and populated. Failure to do so can lead to data being overwritten.
Extensive details about why this is important can be found
in Scylla's Github Issue #1014
Nothing should be writing to SSTables before we have the chance to populate the
existing SSTables and calculate what should the next generation number be.
However, if that happens, we want to protect against it in a way that does not
involve overwriting existing tables. This is one of the ways to do it: every
column family starts in an unwriteable state, and when it can finally be written
to, we mark it as writeable.
Note that this *cannot* be a part of add_column_family. That adds a column family
to a db in memory only, and if anybody is about to write to a CF, that was most
likely already called. We need to call this explicitly when we are sure we're ready
to issue disk operations safely.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The standard C++ exception messages that will be thrown if there is anything
wrong writing the file, are suboptimal: they barely tell us the name of the failing
file.
Use a specialized create function so that we can capture that better.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Deletion of previous stale, temporary SSTables is done by Shard0. Therefore,
let's run Shard0 first. Technically, we could just have all shards agree on the
deletion and just delete it later, but that is prone to races.
Those races are not supposed to happen during normal operation, but if we have
bugs, they can. Scylla's Github Issue #1014 is an example of a situation where
that can happen, making existing problems worse. So running a single shard
first and getting making sure that all temporary tables are deleted provides
extra protection against such situations.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are no longer using the in_flight_seals gate, but forgot to remove it.
To guarantee that all seal operations will have finished when we're done,
we are using the memtable_flush_queue, which also guarantees order. But
that gate was never removed.
The FIXME code should also be removed, since such interface does exist now.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We already have a function that wraps this, re-use it. This FIXME is still
relevant, so just move it there. Let's not lose it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We use memory usage as a threshold these days, and nowhere is _mutation_count
checked. Get rid of it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
While looking at initialization code I felt like my head is going to
explode. Moving initialization into a thread makes things a little bit
better. Only lightly tested.
Message-Id: <20160310163142.GE28529@scylladb.com>
Make sure http_response.hh that is pulled by locator/ec2_snitch.hh is
built. The commit is similar to what commit 6ccf8f8 ("build: make sure
to ask seastar to build http/request_parser.hh, and depend on it") did
for request_parser.hh.
Fixes the following build error on CentOS:
In file included from ./locator/ec2_multi_region_snitch.hh:41:0,
from locator/ec2_multi_region_snitch.cc:39:
./locator/ec2_snitch.hh:24:40: fatal error: http/http_response_parser.hh: No such file or directory
Spotted by Shlomi.
Message-Id: <1457612266-315-1-git-send-email-penberg@scylladb.com>
We start services like gossiper before system keyspace is initialized
which means we can start writing too early. Shuffle code so that system
keyspace is initialized earlier.
Refs #1014
Message-Id: <1457593758-9444-1-git-send-email-penberg@scylladb.com>
If we do
- Decommission a node
- Stop a node
we will shutdown gossip more than once in:
- storage_service::decommission
- storage_service::drain_on_shutdown
Fix by checking if it is already stopped and back off if so.
If we do
- Decommission a node
- Stop a node
we will shutdown messaging_service more than once in:
- storage_service::decommission
- storage_service::drain_on_shutdown
Fixes#1005
Refs #1013
This fix a dtest failure in debug build.
update_cluster_layout_tests.TestUpdateClusterLayout.simple_decommission_node_1_test/
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:802:35:
runtime error: member call on null pointer of type 'struct
future_state'
core/future.hh:334:49: runtime error: member access within null
pointer of type 'const struct future_state'
ASAN:SIGSEGV
=================================================================
==4557==ERROR: AddressSanitizer: SEGV on unknown address
0x000000000000 (pc 0x00000065923e bp 0x7fbf6ffac430 sp 0x7fbf6ffac420
T0)
#0 0x65923d in future_state<>::available() const
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:334
#1 0x41458f1 in future<>::available()
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:802
#2 0x41458f1 in then_wrapped<parallel_for_each(Iterator, Iterator,
Func&&)::<lambda(parallel_for_each_state&)> [with Iterator =
std::__detail::_Node_iterator<std::pair<const net::msg_addr,
net::messaging_service::shard_info>, false, true>; Func =
net::messaging_service::stop()::<lambda(auto:39&)> [with auto:39 =
std::unordered_map<net::msg_addr, net::messaging_service::shard_info,
net::msg_addr::hash>]::<lambda(std::pair<const net::msg_addr,
net::messaging_service::shard_info>&)>]::<lambda(future<>)>, future<>
> /data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:878
Attempt to print std::nested_exception currently results in exception
to leak outside the printer. Fix by capturing all exception in the
final catch block.
For nested exception, the logger will print now just
"std::nested_exception". For nested exceptions specifically we should
log more, but that is a separate problem to solve.
Message-Id: <1457532215-7498-1-git-send-email-tgrabiec@scylladb.com>
Now, get_ranges_for_endpoint will unwrap the first range. With t0 t1 t2
t3, the first range (t3,t0] will be splitted as (min,t0] and (t3,max].
Skippping the range (t3,max] we will get the correct ownership number as
if the first range were not splitted.
Fixes#928
Message-Id: <2e30ebd53f3dba3cc5e0cf36d5541c354b0e30ca.1457506704.git.asias@scylladb.com>
In the preparation phase of streaming, we check that remote node has all
the cf_id which are needed for the entire streaming process, including the
cf_id which local node will send to remote node and wise versa.
So, at later time, if the cf_id is missing, it must be that the cf_id is
deleted. It is fine to ingore no_such_column_family exception. In this
patch, we change the code to ignore at server side to avoid sending the
exception back, to avoid handle exception in an IDL compatiable way.
One thing we can improve is that the sender might know the cf is deleted
later than the receiver does. In this case, the sender will send some
more mutations if we send back the no_such_column_family back to the
sender. However, since we do not throw exceptions in the receiver stream
mutation handler, it will not cause a lot of overhead, the receiver will
just ignore the mutation received.
Fixes#979
It is possible that a cf is deleted after we make the cf reader. Avoid
sending them to avoid the unnecessary overhead to send them on the wire and
the peer node to drop the received mutations.
"Hook streaming with gossip callback so we can abort
the stream_session in such case:
- a node is restarted
- a node is removed from the cluster
Fixes #1001."
Before this patch, reading large ranges from a compressed data file involved
two inefficiencies:
1. The compressed data file was read one compressed chunk at a time.
Such a chunk is around 30 KB in size, well below our desired sstable
read-ahead size (sstable_buffer_size = 128 KB).
2. Because the compressed chunks have variable length (the uncompressed
chunk has a fixed length) they are not aligned to disk blocks, so
consecutive chunks have overlapping blocks which were unnecessarily
read twice.
The fix for both issues is to build the compressed_file_input_stream on
an existing file_input_stream, instead of using direct file IO to read the
individual chunks. file_input_stream takes care of doing the appropriate
amount of read-ahead, and the compressed_file_input_stream layer does the
decompression of the data read from the underlying layer.
Fixes#992.
Historical note: Implementing compressed_file_input_stream on top of
file_input_stream was already tried in the past, and rejected. The problem
at that time was that compressed_file_input_stream's constructor did not
specify the *end* of the range to read, so that when we wanted to read
only a small range we got too much read-ahead beyond the exactly one
compressed chunk that we needed to read. Following the fix to issue #964,
we now know on every streaming read also the intended *end* of the stream,
so we can now use this to stop reading at the end of the last required
chunk, even when we use a read-ahead buffer much larger than a chunk.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457304335-8507-1-git-send-email-nyh@scylladb.com>
We try to be robust against files disappearing (due to any kind of corruption)
inside the data directory.
But if the data directory itself goes missing, that's a situation that we don't
handle correctly. We will keep accepting writes normally, but when we try to
flush the memtable to disk, we'll fail with a system error.
Having the CF directory disappearing is not a common thing. But it is also one
that we can easily protect against, by touching all CF directories we know
about on startup.
Fixes#999
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ed66373dccca11742150a6d08e21ece3980227d3.1457379853.git.glauber@scylladb.com>
- Start a node
- Inject data
- Start another node to bootstrap
- Before the second node finishes streaming, kill the second node
- After a while the node will be removed from the cluster becusue it does
not manage to join the cluster.
- At this time, messaging_service might keep retrying the
stream_mutations unncessarily.
To fix, check if the peer node is still a known node in the gossip.
If the peer node of a stream_session is restarted or removed we should
abort the streaming. It is better to hook gossip callback in the stream
manager than in each streamm_session.
In due time we will have to fix this, but as an interim step, let's use
a "better" magic number.
The problem with 100, is that as soon as the partitions start to go bigger,
we're using too much memory. Since this is multiplied by the number of token
ranges, and happens in every shard, the final number can become really big,
and the amount of resources we use go up proportionally.
This means that even we are mistaken about the new number (we probably are),
in this case it is better to err on the side of a more conservative resource
usage.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <97158f3db5734916cee4ccf12eaa66e7402570bb.1457448855.git.glauber@scylladb.com>
When we do a streaming read that knows the expected *end* position of the
read, we can use a large read-ahead buffer, and at the same time, stop
reading at exactly the intended end (or small rounding of it to the DMA
block size) and not waste resources blindly reading a large amount of data
after the end just to fill the read-ahead buffer.
The sstable reading code, both for reading the data file and the index file,
created a file input stream without specifiying its end, thereby losing
this optimization - so when a large buffer was used, we would get a large
over-read. This patch fixes this, so sstable data file and index file are
read using a file input stream which is a ware of its end.
Fixes#964.
Note that this patch does not change the behavior when reading a
*compressed* data file. For compressed read, we did not have the problem
of over-read in the first place, because chunks are read one by one.
But we do have other sources of inefficiencies there (stemming, again,
from the fact that the compressed chunks are read one by one), and I
opened a separate issue #992 for that.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457219304-12680-1-git-send-email-nyh@scylladb.com>
Fixes#967
Frozen lists are just atomic cells. However, old code inserted the
frozen data directly as an atomic_cell_or_collection, which in turn
meant it lacked the header data of a cell. When in turn it was
handled by internal serialization (freeze), since the schema said
is was not a (non-frozen) collection, we tried to look at frozen
list data as cell header -> most likely considered dead.
Message-Id: <1457432538-28836-1-git-send-email-calle@scylladb.com>
background_reads collectd counter was not always properly decremented.
Fix it and streamline background read repair error handling.
Message-Id: <20160307182255.GI4849@scylladb.com>
Currently it is waited upon only if background read repair check is
needed and this cause unhandled exception warning to be printed if
it enters failed state. Fix this by always waiting on it, but doing
anything beyond ignoring an exception only if check is needed.
Message-Id: <1457351304-28721-1-git-send-email-gleb@scylladb.com>
Currently write acknowledgements handling does not take bootstrapping
node into account for CL=EACH_QUORUM. The patch fixes it.
Fixes#994
Message-Id: <20160307121620.GR2253@scylladb.com>
During bootstrapping additional copies of data has to be made to ensure
that CL level is met (see CASSANDRA-833 for details). Our code does
that, but it does not take into account that bootstraping node can be
dead which may cause request to proceed even though there is no
enough live nodes for it to be completed. In such a case request neither
completes nor timeouts, so it appear to be stuck from CQL layer POV. The
patch fixes this by taking into account pending nodes while checking
that there are enough sufficient live nodes for operation to proceed.
Fixes#965
Message-Id: <20160303165250.GG2253@scylladb.com>
From Vlad:
This series modifies the 'database' class to use the internal
_enable_incremental_backups value (initialized with
'incremental_backups' configuration value) instead of using the
'incremental_backups' configuration value directly.
Then we update this internal value in runtime from 'nodetool
enable/disablebackup' API callback so that newly created keyspaces and
column families use the newly configured incremental backup
configuration.
From Vlad:
This series fixes the first part of issue #909 (the second part has a
separate github issue #965) which is a discrepancy between a
storage_service::token_metadata and a gossiper::endpoint_state_map
contents on non-zero shards.
In region destructor, after active segments is freed pointer to it is
left unchanged. This confuses the remaining parts of the destructor
logic (namely, removal from region group) which may rely on the
information in region_impl::_active.
In this particular case the problem was that code removing from the
region group called region_impl::occupancy() which was
dereferencing _active if not null.
Fixes#993.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1457341670-18266-1-git-send-email-pdziepak@scylladb.com>
'nodetool enable/disablebackup' callback was modifying only the
existing keyspaces and column families configurations.
However new keyspaces/column families were using
the original 'incremental_backups' configuration value which could
be different from the value configured by 'nodetool enable/disablebackup'
user command.
This patch updates the database::_enable_incremental_backups per-shard
value in addition to updating the existing keyspaces and column families
configurations.
Fixes#845
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Store the "incremental_backups" configuration value in the database
class (and use it when creating a keyspace::config) in order to be
able to modify it in runtime.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
If storage_service::token_metadata is not distributed together with
gossiper::endpoint_state_map there may be a situation when a non-zero
shard sees a new value in token_metadata (e.g. newly added node's
token ranges) while still seeing an old gossiper::endpoint_state_map
contents (e.g. a mentioned above newly added node may not be present,
thus causing gossiper::is_alive() to return FALSE for that node, while
the node is actually alive and kicking).
To avoid this discrepancy we will always update a token_metadata together
with an endpoint_state_map when we distribute new token_metadata data
among shards.
Fixes#909
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
We will need to access it from a storage_service class when replicate
token_metadata.
Rename _shadow_endpoint_state_map -> shadow_endpoint_state_map
according to our coding convention.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
If timeout happens after cl promise is fulfilled, but before
continuation runs it removes all the data that cl continuation needs
to calculate result. Fix this by calculating result immediately and
returning it in cl promise instead of delaying this work until
continuation runs. This has a nice side effect of simplifying digest
mismatch handling and making it exception free.
Fixes#977.
Message-Id: <1457015870-2106-3-git-send-email-gleb@scylladb.com>
Read executor may ask for more than one data reply during digest
resolving stage, but only one result is actually needed to satisfy
a query, so no need to store all of them.
Message-Id: <1457015870-2106-2-git-send-email-gleb@scylladb.com>
In digest resolver for cl to be achieved it is not enough to get correct
number of replies, but also to have data reply among them. The condition
in digest timeout does not check that, fortunately we have a variable
that we set to true when cl is achieved, so use it instead.
Message-Id: <1457015870-2106-1-git-send-email-gleb@scylladb.com>
Ubuntu 14.04LTS package is broken now because iotune does not statically linked against libstdc++, so this patch fixed it.
Requires seastar patch to add --static-stdc++ on configure.py.
Fixes#982
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456995050-22007-1-git-send-email-syuu@scylladb.com>
1) As explained in commit 697b16414a (gossip: Make gossip message
handling async), in each gossip round we can make talking to the 1-3
peer nodes in parallel to reduce latency of gossip round.
2) Gossip syn message uses one way rpc message, but now the returned
future of the one way message is ready only when message is dequeued for
some reason (sent or dropped). If we wait for the one way syn messge to
return it might block the gossip round for a unbounded time. To fix, do
not wait for it in the gossip round. The downside is there will be no
back pressure to bound the syn messages, however since the messages are
once per second, I think it is fine.
Message-Id: <ea4655f121213702b3f58185378bb8899e422dd1.1456991561.git.asias@scylladb.com>
Currently schema changes are only logged at coordinator node which
initiates the change. It would be helpful in post morten analysis to
also see when and how schema changes are resolved when applied on
other nodes.
Message-Id: <1456953095-1982-1-git-send-email-tgrabiec@scylladb.com>
Use the existing "feed_hash" mechanism to find a checksum of the
content of a mutation, instead of serializing the mutation (with freeze())
and then finding the checksum of that string.
The serialized form is more prone to future changes, and not really
guaranteed to provide equal hashes for mutations which are considered
"equal".
Fixes#971
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1456958676-27121-1-git-send-email-nyh@scylladb.com>
While is is formally better to take a local lock first and
then first contend for a global, in this case it is arguably
better to ensure we get a gate exception synchronously (early)
instead of potentially in a continuation. Old version might
cause us to do a gate::leave even while never entered.
And since we should really only have one active (contending)
segment per shard anyway, it should not matter.
Message-Id: <1456931988-5876-1-git-send-email-calle@scylladb.com>
Fixes#865
(Some) gcc 5 (5.3.0 for me) on ubuntu will generate errors on
compilation of this code (compiling logalloc_test). The memcpy
to inline storage seems to confuse the compiler.
Simply change to std::copy, which shuts the compiler up.
Any decent stl should convert primitive std::copy to memcpy
anyway, but since it is also the inline (small storage),
it should not matter which way.
Message-Id: <1456931988-5876-4-git-send-email-calle@scylladb.com>
"This series implements describe_schema_versions so that we nodetool
describecluster can return proper schema information for the whole
cluster. It involves adding new verb SCHEMA_CHECK which is used to get
schema version for a given node and a simple map-reduce that using that
verb gets info from the whole cluster.
This fixes#677, fixes#684, and fixes #472."
Useful for determining order of events in logs of different nodes, or
for estimating how much time passed between two events.
Fixes#941.
Example log:
INFO 2016-03-01 18:30:37,688 [shard 0] gossip - Waiting for gossip to settle before accepting client requests...
INFO 2016-03-01 18:30:45,689 [shard 0] gossip - No gossip backlog; proceeding
INFO 2016-03-01 18:30:45,689 [shard 0] storage_service - Starting listening for CQL clients on localhost:9042...
Message-Id: <1456853532-28800-1-git-send-email-tgrabiec@scylladb.com>
Start with coarse control:
1) converting the run_with_write_api_lock operations:
join_ring, start_gossiping, stop_gossiping, start_rpc_server,
stop_rpc_server, start_native_transport, stop_native_transport,
decommission, remove_node, drain, move, rebuild
to use run_with_api_lock which uses a flag to indicate current operation
in progress.
If one of the above operation is in progress when admin issues another
opeartion we return a "try again" exception to avoid running two
operations in parallel.
2) converting the run_with_read_api_lock to use no lock.
Fixes#850.
Message-Id: <00782b601028ed87437e5decae382f72dff634f6.1456758391.git.asias@scylladb.com>
Currently when reading a view to an object the stream stored has the
same bound as the containing stream, not the bounds of the object
itself. The serializer of the view assumes that the stream has the
bounds of the object itself.
Fixes dtest failure in
paging_test.py:TestPagingSize.test_undefined_page_size_default
Fixes#963.
Message-Id: <1456854556-32088-1-git-send-email-tgrabiec@scylladb.com>
The segment->segment_manager pointer has, until now, been a raw pointer,
which in a way is sensible, since making circular shared pointer
relations is in general bad. However, since the code and life cycle
of segments has evolved quite a bit since that initial relation
was defined, becoming both more and then suddenly, in a sense,
less, asynchronous over time, the usage of the relation is in fact
more consistent with a shared pointer, in that a segment needs to
access its manager to properly do things like write and flush.
These two ops in particular depend on accessing the segment manager
in a way that might be fine even using raw pointers, if it was not
again for that little annoying thing of continuation reordering.
So, lets just make the relation a shared pointer, solving the issue
of whether the manager is alive when a segment accesses it. If it
has been "released" (shut down), the existing mechanisms (gate)
will then trigger and prevent any actual _actions_ from taking
place. And we don't have to complicate anything else even more.
Only "big" change is that we need to explicitly orphan all
segments in commitlog destructor (segment_manager is essentially
a p-impl).
This fixes some spurious crashes in nightly unit tests.
Fixes#966.
Message-Id: <1456838735-17108-1-git-send-email-calle@scylladb.com>
We cannot use shared_ptr *instances* for checking duplicate column
definitions because they are never equal. Store column definition name
in the unordered_map instead.
Fixes cql_additional_tests.py:TestCQL.identifier_test.
Spotted by Shlomi.
Message-Id: <1456840506-13941-1-git-send-email-penberg@scylladb.com>
When the first time the keep alive timer fires, the _last_stream_bytes
btyes will be zero since it is the first time we update it. The keep
alive timer will be rearmed and fired again. The second time, we find
there is no progress, we close the session. The total idle time will be
2 * keep alive timer.
To make the idle time to close the session be more precise, we reduce
the interval to check the progess and close the session by checking last
time the progress is made.
Message-Id: <c959cffce0cc738a3d73caaf71d2adb709d46863.1456831616.git.asias@scylladb.com>
Checking schema::is_dense() is not enough to know whether row marker
should be inserted or not as there may be compact storage tables that
are not considered dense (namely, a table with now clustering key).
Row marker should only be insterted if schema::is_cql3_table() is true.
Fixes#931.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1456834937-1630-1-git-send-email-pdziepak@scylladb.com>
corrupt_segment() is meant to write some garbage at arbitrary position
in the commitlog segment. That position is not necessairly properly
aligned for uint32_t.
Silences ubsan complaints about unaligned write.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1456827726-21288-1-git-send-email-pdziepak@scylladb.com>
Unlike CentOS/Fedora, scylla_io_setup is calling from pre-start section of scylla-server upstart job, not from separated job.
This is because Upstart does not provide same behavior as After / Requires directives on systemd.
Fixes#954.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456825805-4195-1-git-send-email-syuu@scylladb.com>
This patch change the way optional vector are implemented.
Now a vector of optional would be handle like any other non primitive
types, with a single method add() that would return a writer to the
optional.
The writer to the optional would have a skip and write method like
simple optional field.
For basic types the write method would get the value as a parameter, for
composite type, it would return a writer to the type.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456796143-3366-2-git-send-email-amnon@scylladb.com>
To report disk usage, scylla was only taking into account size of
sstable data component. Other components such as index and filter
may be relatively big too. Therefore, 'nodetool status' would
report an innacurate disk usage. That can be fixed by taking into
account size of all sstable components.
Fixes#943.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <08453585223570006ac4d25fe5fb909ad6c140a5.1456762244.git.raphaelsc@scylladb.com>
Fixes#482
See code comment. Reserve segment allocation count sum can temporarily
overflow due to continuation delay/reordering, if we manage to reach the
on_timer code before finally clauses from previous reserve allocation
invocation has processed. However, since these are benign overflows
(just indicating even more that we don't need to do anything right now)
simply capping the count should be fine.
Avoids assert in boost irange.
Message-Id: <1456740679-4537-1-git-send-email-calle@scylladb.com>
"This patchset fixes#950, run scylla-io-setup before scylla-server on anycase, and installs example /etc/scylla.d/io.conf by default to prevent error on 'EnvironmentFile=/etc/scylla.d/*.conf'."
With this change, you can define your own prefix of AMI name in variable.json.
example:
{
"access_key": "xxx",
"secret_key": "xxx",
"subnet_id": "xxx",
"security_group_id": "xxx",
"region": "us-east-1",
"associate_public_ip_address": "true",
"instance_type": "c4.xlarge",
"ami_prefix": "takuya-"
}
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456329247-5109-1-git-send-email-syuu@scylladb.com>
Prevent error on 'EnvironmentFile=/etc/scylla.d/*.conf'.
Parameters are commented out, and the file will replace when scylla starts, by scylla-io-setup.service.
"The series includes Amnon's unmerged support for optional<> in idl-compiler.
Depends on seastar patch "[PATCH seastar] simple_input_stream: Introduce begin()".
The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.
perf_simple_query shows slight regression in throughput (-8%):
build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000
Before: ~433k tps
After: ~400k tps"
We require SSE 4.2 (for commitlog CRC32), verify it exists early and bail
out if it does not.
We need to check early, because the compiler may use newer instructions
in the generated code; the earlier we check, the lower the probability
we hit an undefined opcode exception.
Message-Id: <1456665401-18252-1-git-send-email-avi@scylladb.com>
Before:
ERROR [shard 0] storage_service - Format of host-id =
marshal_exception (marshalling error) is incorrect ???
Exiting on unhandled exception of type 'marshal_exception': marshalling error
After:
ERROR [shard 0] storage_service - Unable to parse 127.0.0.3 as host-id
Exiting on unhandled exception of type 'std::runtime_error': Unable to
parse 127.0.0.3 as host-id
Message-Id: <1456737987-32353-1-git-send-email-asias@scylladb.com>
It is used by
nodetool status
If an api operation inside storage_service takes a long time to finish
, which holds the lock, it will block nodetool status for a long time.
I think it is safe to get the load map even if other operations are in-flight.
Refs: #850
Message-Id: <1456737987-32353-2-git-send-email-asias@scylladb.com>
"This series:
1) Log total bytes sent/recevied when a stream plan completes.
It is useful in test code.
2) Fix http://scylla_ip:10000/stream_manager API"
Otherwise we will leak it, and region destructor will fail:
row_cache_test: utils/logalloc.cc:1211: virtual logalloc::region_impl::~region_impl(): Assertion `seg->is_empty()' failed.
Fixes regression in row_cache_test.
Since invalidate() may allocate, we need to take the region lock to
keep m.partitions references valid around whole clear_and_dispose(),
which relies on that.
The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.
perf_simple_query shows slight regression in throughput (-8%):
build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000
Before: ~433k tps
After: ~400k tps
By initilizing them to 0 we can catch unclosed frames at
deserialization time. It's better than leaving frame size undefined,
which may cause errors much later in deserialization process and thus
would make it harder to identifiy the real cause.
This patch adds optional writer support an optional field can be either
skip or set.
For vector of optional, a write_empty method will
add 1 to the vector count and mark the optional as false.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
For each stream_session, we pretend we are sending/receiving one file,
to make it compatible with nodetool. For receiving_files, the file name
is "rxnofile". For sending_files, the file name is "txnofile".
stream_manager::update_all_progress_info is introduced to update the
progress info of all the stream_sessions in the node. We need this
because streaming mutations are received on all the cores, but the
stream_session object is only on one of the cores. It adds overhead if
we update progress info in stream_session object whenever we receive a
streaming mutation. So, what we do now is when we really need the
progress info, we update the progress info in stream_session object.
With http://127.0.0.$i:10000/stream_manager/, it looks like below when
decommission node 3 in a 3 nodes cluster.
=========== GET NODE 1
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"receiving_files": [{"value": {"direction":
"IN", "file_name": "rxnofile", "session_index": 0, "total_bytes":
16876296, "peer": "127.0.0.3", "current_bytes": 16876296}, "key":
"rxnofile"}], "receiving_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.3", "peer": "127.0.0.3"}]}]
=========== GET NODE 2
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"receiving_files": [{"value": {"direction":
"IN", "file_name": "rxnofile", "session_index": 0, "total_bytes":
16755552, "peer": "127.0.0.3", "current_bytes": 16755552}, "key":
"rxnofile"}], "receiving_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.3", "peer": "127.0.0.3"}]}]
=========== GET NODE 3
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"sending_files": [{"value": {"direction":
"OUT", "file_name": "txnofile", "session_index": 0, "total_bytes":
16876296, "peer": "127.0.0.1", "current_bytes": 16876296}, "key":
"txnofile"}], "sending_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.1", "peer":
"127.0.0.1"},{"sending_files": [{"value": {"direction": "OUT",
"file_name": "txnofile", "session_index": 0, "total_bytes": 16755552,
"peer": "127.0.0.2", "current_bytes": 16755552}, "key": "txnofile"}],
"sending_summaries": [{"files": 1, "total_size": 0, "cf_id":
"869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0, "state":
"PREPARING", "connecting": "127.0.0.2", "peer": "127.0.0.2"}]}]
The problem is that a generic functions (eg. skip()) which call
deserialize() overloads based on their template parameter only see
deserilize() overloads which were declared at the time skip() was
declared and not those which are available at the time of
instantiation. This forces all serializers to be declared before
serialization_visitors.hh is first included. Serializers included
later will fail to compile. This becomes problematic to ensure when
serializers are included from headers.
Template class specialization lookup doesn't suffer from this
limitation. We can use that to solve the problem. The IDL compiler
will now generate template class specializations with read/write
static methods. In addition to that, default serializer() and
deserialize() implementations are delegating to serializer<>
specialization so that API and existing code doesn't have to change.
Message-Id: <1456423066-6979-1-git-send-email-tgrabiec@scylladb.com>
"Gleb has recently noted that our query reads are not even being registered
with the I/O queue.
Investigating what is happening, I found out that while the priority that
make_reader receives was not being properly passed downwards to the SSTable
reader. The reader code is also used by compaction class, and that one is fine.
But the CQL reads are not.
On top of that, there are also some other places where the tag was not properly
propagated, and those are patched."
add_local_application_state is used in various places. Before this
patch, it can only be called on cpu zero. To make it safer to use, use
invoke_on() to foward the code to run on cpu zero, so that caller can
call it on any cpu.
Refs: #795
Message-Id: <d69b81c5561622078dbe887d87209c4ea2e3bf46.1456315043.git.asias@scylladb.com>
Gleb saw once:
scylla: gms/gossiper.cc:1393:
gms::gossiper::add_local_application_state(gms::application_state,
gms::versioned_value):: mutable: Assertion
`endpoint_state_map.count(ep_addr)' failed.
The assert is about we can not find the entry in endpoint_state_map of
the node itself. I can not really find any place we could call
add_local_application_state before we call gossiper::start_gossiping()
where it inserts broadcast address into endpoint_state_map.
I can not reproduce issue, let's log the error so we can narrow down
which application state triggered the assert.
Refs: #795
Message-Id: <f4433be0a0d4f23470a5e24e528afdb67b74c7ef.1456315043.git.asias@scylladb.com>
Fixes#934 - faulty assert in discard_sstables
run_with_compaction_disabled clears out a CF from compaction
mananger queue. discard_sstables wants to assert on this, but looks
at the wrong counters.
pending_compactions is an indicator on how much interested parties
want a CF compacted (again and again). It should not be considered
an indicator of compactions actually being done.
This modifies the usage slightly so that:
1.) The counter is always incremented, even if compaction is disallowed.
The counters value on end of run_with_compaction_disabled is then
instead used as an indicator as to whether a compaction should be
re-triggered. (If compactions finished, it will be zero)
2.) Document the use and purpose of the pending counter, and add
method to re-add CF to compaction for r_w_c_d above.
3.) discard_sstables now asserts on the right things.
Message-Id: <1456332824-23349-1-git-send-email-calle@scylladb.com>
We call a mutation source during the query path without any consideration
for attaching a priority. This is incorrect, and queries called through this
facility will end up in the default class.
Fix this by attaching the query priority class here.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Not all SSTable readers will end up getting the right tag for a priority
class. In particular, the range reader, also used for the memtables complete
ignores any priority class.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are situations when a memtable is already flushed but the memtable
reader will continue to be in place, relaying reads to the underlying
table.
For that reason, the "memtables don't need a priority class" argument
gets obviously broken. We need to pass a priority class for its reader
as well.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Fixes#937
In fixing #884, truncation not truncating memtables properly,
time stamping in truncate was made shard-local. This however
breaks the snapshot logic, since for all shards in a truncate,
the sstables should snapshot to the same location.
This patch adds a required function argument to truncate (and
by extension drop_column_family) that produces a time stamp in
a "join" fashion (i.e. same on all shards), and utilizes the
joinpoint type in caller to do so.
Message-Id: <1456332856-23395-2-git-send-email-calle@scylladb.com>
Lets operations working on all shards "join" and acquire
the same value of something, with that value being based on
whenever all shards reach the join.
Obvious use case: time stamp after one set of per-shard ops, but
before final ones.
The generation of the value is guaranteed to happen on the shards
that created the join point.
Based on the join-ops in CF::snapshot, but abstracted and made
caller responsibility. Primary use case is to help deal with
the join-problem of truncation.
Message-Id: <1456332856-23395-1-git-send-email-calle@scylladb.com>
The gocql driver assumes that there's a result metadata section in the
PREPARED message. Technically, Scylla is not at fault here as the CQL
specification explicitly states in Section 4.2.5.4. ("Prepared") that the
section may be empty:
- <result_metadata> is defined exactly as <metadata> but correspond to the
metadata for the resultSet that execute this query will yield. Note that
<result_metadata> may be empty (have the No_metadata flag and 0 columns, See
section 4.2.5.2) and will be for any query that is not a Select. There is
in fact never a guarantee that this will non-empty so client should protect
themselves accordingly. The presence of this information is an
However, Cassandra always populates the section so lets do that as well.
Fixes#912.
Message-Id: <1456317082-31688-1-git-send-email-penberg@scylladb.com>
* dist/ami/files/scylla-ami 398b1aa...d4a0e18 (3):
> Sort service running order (scylla-ami-setup.service -> scylla-io-setup.service -> scylla-server.service)
> Drop --ami and --disk-count parameters
> dist: pass the number of disks to set io params
In each gossip round, i.e., gossiper::run(), we do:
1) send syn message
2) peer node: receive syn message, send back ack message
3) process ack message in handle_ack_msg
apply_state_locally
mark_alive
send_gossip_echo
handle_major_state_change
on_restart
mark_alive
send_gossip_echo
mark_dead
on_dead
on_join
apply_new_states
do_on_change_notifications
on_change
4) send back ack2 message
5) peer node: process ack2 message
apply_state_locally
At the moment, syn is "wait" message, it times out in 3 seconds. In step
3, all the registered gossip callbacks are called which might take
significant amount of time to complete.
In order to reduce the gossip round latency, we make syn "no-wait" and
do not run the handle_ack_msg insdie the gossip::run(). As a result, we
will not get a ack message as the return value of a syn message any
more, so a GOSSIP_DIGEST_ACK message verb is introduced.
With this patch, the gossip message exchange is now async. It is useful
when some nodes are down in the cluster. We will not delay the gossip
round, which is supposed to run every second, 3*n seconds (n = 1-3,
since it talks to 1-3 peer nodes in each gossip round) or even
longer (considering the time to run gossip callbacks).
Later, we can make talking to the 1-3 peer nodes in parallel to reduce
latency even more.
Refs: #900
We will soon switch to use no-wait message for gossip. GOSSIP_DIGEST_SYN
will no longer return GOSSIP_DIGEST_ACK message. So we need a standalone
verb for GOSSIP_DIGEST_ACK.
The problem is we initialize _last_interpret when failure_detector
object is constructed. When interpret() runs for the first time, the
_last_interpret value is not the last time we run interpret() but the
time we initialize failure_detector object.
Fix by initializing _last_interpret inside interpret().
[Thu Feb 18 02:40:04 2016] INFO [shard 0] storage_service - Node 127.0.0.1 state jump to normal
[Thu Feb 18 02:40:04 2016] INFO [shard 0] storage_service - NORMAL: node is now in normal status
[Thu Feb 18 02:40:04 2016] INFO [shard 0] gossip - Waiting for gossip to settle before accepting client requests...
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - No gossip backlog; proceeding
Starting listening for CQL clients on 127.0.0.1:9042...
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - Node 127.0.0.2 is now part of the cluster
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - InetAddress 127.0.0.2 is now UP
[Thu Feb 18 02:40:13 2016] INFO [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2
[Thu Feb 18 02:40:13 2016] WARN [shard 0] failure_detector - Not marking nodes down due to local pause of 9091 > 5000 (milliseconds)
"Add scylla-io-setup.service to configure max-io-requests and num-io-queues on first boot.
Moved SCYLLA_IO configuration code from scylla_sysconfig_setup to scylla-io-setup.service, revert commits related it.
On scylla-io-setup.service, autodetect Amazon EC2 instead of using AMI variable on sysconfig."
While serialization vector it is sometimes required to rollback some of
the serialized elements.
vector_position is the equivalent to the bytes_ostream position struct.
It holds information about the current position in a serialized vector,
the position in the bufffer and the current number of elements
serialized.
It will allow to rollback to the current point.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456041750-1505-2-git-send-email-amnon@scylladb.com>
Fixes#927.
The new visiting code builds cell instances using
atomic_cell::make_*() factory methods, which won't work in LSA context
because they depend on managed_bytes storage to be linearized. It may
not be since large blob support. This worked before because we created
cells from views before which works in all contexts.
Fix by constructing them in standard allocator context.
Message-Id: <1456234064-13608-2-git-send-email-tgrabiec@scylladb.com>
The line:
boost::apply_visitor(atomic_cell_or_collection_visitor(std::move(visitor), id, col), cell);
is executed in a loop, so visitor could be used after being
moved-from. This may not always be allowed for some visitors. Also,
vistors may keep state, which should be preserved for the whole
visitation.
This doesn't fix any issue right now.
Message-Id: <1456234064-13608-1-git-send-email-tgrabiec@scylladb.com>
The 'clear' function explicitly clears the screen and repaints it which
causes really annoying flicker. Use 'erase' to make scyllatop more
pleasant on the eyes.
Message-Id: <1456229348-2194-1-git-send-email-penberg@scylladb.com>
It is often the case that the there is useful debugging information
printed by the test before it hangs. It is annoying to see just "TIMED
OUT" in jenkins. Print the output always when it is available.
In addition to that, we should not interpret all exceptions thrown
from communicate() as timeouts. For example, currently ^C sent to the
script misleadingly results in "TIMED OUT" to be printed.
Message-Id: <1456174992-21909-1-git-send-email-tgrabiec@scylladb.com>
Every native scalar function is already tagged whether they're pure or
not but because we don't implement the is_pure() function, all functions
end up being advertised as pure. This means that functions like now()
that are *not* pure, end up being evaluated only once.
Fixes#571.
Message-Id: <1456227171-461-1-git-send-email-penberg@scylladb.com>
In same cases we may have a lot of empty partitions whose tombstones
expired, and there is no point in including them in the results.
This was found to cause performance issues for workloads using batch
updates. system.batchlog table would accumulate a lot of deletes over
time. It has gc_grace_seconds set to 0 so most of the tombstones would
be expired. mutation queries done by batchlog manager were still
returning all partitions present in memtables which caused mutation
queries result to be inflated. This in turn was causing
mutation_result_merger to take a long time to process them.
We don't support the 'CREATE TYPE' statement for now. The user-visible
error message, however, is unreadable because our CQL parser doesn't
even recognize the statement.
cqlsh:ks1> CREATE TYPE config (url text);
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message=" : cannot match to any predicted input...
Implement just enough of 'CREATE TYPE' parsing to be able to report a
human readable error message if someone tries to execute such
statements:
cqlsh:ks1> CREATE TYPE config (url text);
ServerError: <ErrorMessage code=0000 [Server error] message="User-defined types are not supported yet">
Message-Id: <1456148719-9473-2-git-send-email-penberg@scylladb.com>
As explained in commit 0ff0c55 ("transport: server: 'short' should be
unsigned"), "short" type is always unsigned in the CQL binary protocol.
Therefore, drop the read_unsigned_short() variant altogether and just
use read_short() everywhere.
Message-Id: <1456133171-1433-1-git-send-email-penberg@scylladb.com>
This may result in errors during reading like the following one:
runtime error: Unexpected marker. Found k, expected \x01\n)'
The error above happened when executing limits.py:max_key_length_test dtest.
After change the exception will happen during writing and will be clearer.
Refs #807.
This patch doesn't deal with the problem of ensuring that we will
never hit those errors, which is very desirable. We shouldn't ack a
write if we can't persist it to sstables.
Message-Id: <1456130045-2364-1-git-send-email-tgrabiec@scylladb.com>
Change the name used with class_registrator from "EverywhereReplicationStrategy"
(used in the initial patch from CASSANDRA-826 JIRA) to "EverywhereStrategy"
as it is in the current DCE code.
With this change one will be able to create an instance of
everywhere_replication_strategy class by giving either
an "org.apache.cassandra.locator.EverywhereStrategy" (full name) or
an "EverywhereStrategy" (short name) as a replication strategy name.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456081258-937-1-git-send-email-vladz@cloudius-systems.com>
This strategy would ignore an RF configuration and would
always try to replicate on all cluster nodes.
This means that its get_replication_factor() would return a
number of currently "known" nodes in the cluster and
if a cluster is currently bootstrapping this value obviously may
change in time for the same key. Therefore using this strategy
should be done with caution.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456074333-15014-3-git-send-email-vladz@cloudius-systems.com>
Return a number of currently known endpoints when
it's needed in a fast path flow.
Calling a get_all_endpoints().size() for that matter
would not be fast enough because of the unordered_set->vector transformation
we don't need.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456074333-15014-2-git-send-email-vladz@cloudius-systems.com>
Current algorithm is O(N^2) where N is the column count. This causes
limits.py:TestLimits.max_columns_and_query_parameters_test to timeout
because CREATE TABLE statement takes too long.
This change replaces it with an algorithm of O(N)
complexity. _defined_names are already sorted so if any duplicates
exist, they must be next to each other.
Message-Id: <1456058447-5080-1-git-send-email-tgrabiec@scylladb.com>
When find_uuid() fails Scylla would terminate with:
Exiting on unhandled exception of type 'std::out_of_range': _Map_base::at
But we are supposed to ignore directories for unknown column
families. The try {} catch block is doing just that when
no_such_column_family is thrown from the find_column_family() call
which follows find_uuid(). Fix by converting std::out_of_range to
no_such_column_family.
Message-Id: <1456056280-3933-1-git-send-email-tgrabiec@scylladb.com>
When there is a lot of chunks we may get stack overflow.
This seems to fix issue #906, a memory corruption during schema
merge. I suspect that what causes corruption there is overflowing of
the stack allocated for the seastar thread. Those stacks don't have
red zones which would catch overflow.
Message-Id: <1456056288-3983-1-git-send-email-tgrabiec@scylladb.com>
before:
^CINFO [shard 0] compaction_manager - Asked to stop
INFO [shard 0] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 0] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 1] compaction_manager - Asked to stop
INFO [shard 2] compaction_manager - Asked to stop
INFO [shard 1] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 2] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 3] compaction_manager - Asked to stop
INFO [shard 1] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 2] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 3] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 3] compaction_manager - compaction task handler stopped due to shutdown
after:
^CINFO [shard 0] compaction_manager - Asked to stop
INFO [shard 0] compaction_manager - Stopped
INFO [shard 1] compaction_manager - Asked to stop
INFO [shard 2] compaction_manager - Asked to stop
INFO [shard 3] compaction_manager - Asked to stop
INFO [shard 1] compaction_manager - Stopped
INFO [shard 2] compaction_manager - Stopped
INFO [shard 3] compaction_manager - Stopped
`compaction_manager - compaction task handler stopped due to shutdown` is still printed
in debug level
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <535d5ad40102571a3d5d36257342827989e8f0f4.1455835407.git.raphaelsc@scylladb.com>
"This series switches mutation_partition_serializer, mutation_partition_view
and frozen_mutation to the IDL-based serialization format.
canonical_mutations and frozen_schemas are still not converted.
Quick test with 4 node ccm cluster and cassandra-stress doesn't show any
problem, unsurprisingly, as frozen_mutation_test obviously still passes."
Allows mixing data_output with other output stream like
seastar::simple_output_stream which is useful when switching to the new
IDL-based serializers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
bytes can always be trivially converted to bytes_view. Conversion in the
other direction requires a copy.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch allows using both auto-generated serializers or writer-based
serialization for non-stub [[writable]] types.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This helps having C++ code properly indented both in the compiler source
code and in the auto-generated files.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Test auto-generated and writer-based serialization as well as
deserialization of simple compound type, vectors and variants.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
To implement nodetool's "--start-token"/"--end-token" feature, we need
to be able to repair only *part* of the ranges held by this node.
Our REST API already had a "ranges" option where the tool can list the
specific ranges to repair, but using this interface in the JMX
implementation is inconvenient, because it requires the *Java* code
to be able to intersect the given start/end token range with the actual
ranges held by the repaired node.
A more reasonable approach, which this patch uses, is to add new
"startToken"/"endToken" options to the repair's REST API. What these
options do is is to find the node's token ranges as usual, and only
then *intersect* them with the user-specified token range. The JMX
implementation becomes much simpler (in a separate patch for scylla-jmx)
and the real work is done in the C++ code, where it belongs, not in
Java code.
With the additional scylla-jmx patch to use the new REST API options
provided here, this fixes#917.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455807739-25581-1-git-send-email-nyh@scylladb.com>
'nodetool cleanup' must wait for termination of cleanup, however,
cleanup is handled asynchronously. To solve that, a mechanism is
added here to wait for termination of a cleanup. This mechanism is
about using promise to notificate waiter of cleanup completion.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <6dc0a39170f3f51487fb8858eb443573548d8bce.1455655016.git.raphaelsc@scylladb.com>
"writers are used to stream objects programmatically and not from objects.
visitors views are used to retrieve information from serialized objects without
deserialize them entirely, but to skip to the position in the buffer with the
relevant information and deserialize only it."
This patch adds static assert to the generated code that verify that a
declare type in the idl matches the parameter type.
Accepted type ignores reference and const when comparison.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a writer object to classes in the idl.
It adds attribute support to classes. Writer will be created for classes
that are marked as writable.
For the writers, the code generator creates two kind of struct.
states, that holds the write state (manly the place holders for all
current objects, and vectors) and nodes that represent the current
position in the writing state machine.
To write an object create a writer:
For example creating a writer for mutation, if out is a bytes_ostream
writer_of_mutation w(out);
Views are used to read from buffer without deserialize an entire
object.
This patch adds view creation to the idl-compiler. For each view a
read_size function is created that will be used when skipping through
buffers.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The skip template function is used when skipping data types.
By default is uses a deserializer to calculate the size.
A specific implementation save unneeded deserialization. For fix sized
object the skip function would become an expression allowing the
compiler to drop the function altogether.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The serialization_visitors.hh contains helper classes for the readers and writers
visitors class.
place_holder is a wrapper around bytes_stream place holder.
frame is used to store size in bytes.
empty_frame is used with final object (that do not store their size)
from the code that uses it, it looks the same, but in practice it does
not store any data.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Reader and writer can use the bytes_ostream as a raw bytes stream,
handling the bytes encoding and streaming on their own.
To fully support this functionality, place holder should support it as
well.
This patch adds a get_stream method that return a simple_output_stream
writer can use it using their own serialization function.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a stub support for boost::variant. Currently variant are
not serialized, this is added just so non stub classes will be able to
compile.
deserialize for chrono::time_point and deserializer for chrono::duration
unknown variant:
Planning for situations where variant could be expanded, there may be
situation that a variant will return an unknown value.
In those cases the data and index will be paseed to the reader, that
can decide what to do with it.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
* dist/ami/files/scylla-ami b3b85be...398b1aa (3):
> Import AMI initialization code from scylla-server repo
> Use long options on scylla_raid_setup and scylla_sysconfig_setup
> Wait more longer to finishing AMI setup
* seastar b25a958...1bbb02f (6):
> native-stack: fix arp request missing under loopback connection
> apps: iotune: fix compilation with g++ 4.9
> simple-stream: Add copy constructor
> tcp: don't need to choose another core since only one core
> Merge "Fix undefined behaviors related to reactor shutdown" from Tomasz
> rpc: do not wait for data to be send before reporting timeout
From Avi:
This patchset introduces a linearization context for managed_bytes objects.
Within this context, any scattered managed_bytes (found only in lsa regions,
so limited to memtable and cache) are auto-linearized for the lifetime of
the context. This ensures that key and value lookups can use fast
contiguous iterators instead of using slow discontiguous iterators (or
crashing, as is the case now).
To avoid scattered keys (and values, though those are already protected)
from being accessed, run the update procedure in a managed_bytes linearization
context.
Fixes#807.
Avi says:
"Something like unordered_set<unsigned long> is error prone, because ints
tend to mix up (also, need to use a sized type, unsigned long varies among
machines)."
With that in mind, it's better if we keep track of compacting sstables in
a unordered_set<shared_sstable>.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <249f0fd4cfcf786cf3c37a79978f7743d07f48ad.1455120811.git.raphaelsc@scylladb.com>
* seastar 353b1a1...0f759f0 (11):
> tutorial: add a link to future API documentation
> sleep: document
> tutorial: fix typos
> gate: add check() method
> tutorial: introduce seastar::gate
> doc: explain how to test the native stack without dpdk
> doc: separate the mini-tutorial into its own file
> doc: move DPDK build instructions to its own file
> doc: split building instructions into separate files
> doc: fix/modernize git commands in contributing.md
> doc: how-to on contributing & guidelines
We want the format of query results to be eventually defined in the
IDL and be independent of the format we use in memory to represent
collections. This change is a step in this direction.
The change decouples format of collection cells in query results from
our in-memory representation. We currently use collection_mutation_view,
after the change we will use CQL binary protocol format. We use that because
it requires less transformations on the coordinator side.
One complication is that some list operations need to retrieve keys
used in list cells, not only values. To satisfy this need, new query
option was added called "collections_as_maps" which will cause lists
and sets to be reinterpreted as maps matching their underlying
representation. This allows the coordinator to generate mutations
referencing existing items in lists.
List operations and prefetching were not handling static columns
correctly. One issue was that prefetching was attaching static column
data to row data using ids which might overlap with clustered columns.
Another problem was that list operations were always constructing
clustering key even if they worked on a static column. For static
columns the key would be always empty and lookup would fail.
The effect was that list operations which depend on curent state had
no effect. Similar problem could be observed on C* 2.1.9, but not on 2.2.3.
Fixes#903.
This puts knowledge about which cql_serialization_formats have the
same collection format into one place,
cql_serialization_format::collection_format_unchanged().
The validation was wrongly assuming that empty thrift key, for which
the original C* code guards against, can only correspond to empty
representation of our partition_key. This no longer holds after:
commit 095efd01d6
"keys: Make from_exploded() and components() work without schema"
This was responsible for dtest failure:
cql_additional_tests.TestCQL:column_name_validation_test
serialize() and from_bytes() is a low level interface, which in this
case can be replaced with a partition_key static factory method
resulting in cleaner code.
This reverts commit dadd097f9c.
That commit caused serialized forms of varint and decimal to have some
excess leading zeros. They didn't affect deserialization in any way but
caused computed tokens to differ from the Cassandra ones.
Fixes#898.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1455537278-20106-1-git-send-email-pdziepak@scylladb.com>
"Fixes #884Fixes#895
Also at seastar-dev: calle/truncate_more
1.) Change truncation records to be stored with IDL serialization
2.) Fix db::serializers encoding of replay_position
3.) Detect attempted reading of Origin truncation records, and instead
of crashing, ignore and warn.
4.) Change truncation time stamps to be generated per-shard, _after_
CF flush is done, otherwise data in memtables at flush would be
retained/replayed on next start. Retain the highest time stamp
generated.
Note for (3): This patch set does _not_ clear out origin records
automatically. This because I feel that is a somewhat drastic and
irreversible thing to do. If we want to avail the user of a means
to get rid of the (3) warning, we should probably tell him to either
use cqlsh, or add an API call for this, so he can do it explicitly.
"
When shutting down a node gracefully, this patch asks all ongoing repairs
started on this node to stop as soon as possible (without completing
their work), and then waits for these repairs to finish (with failure,
usually, because they didn't complete).
We need to do this, because if the repair loop continues to run while we
start destructing the various services it relies on, it can crash (as
reported in #699, although the specific crash reported there no longer
occurs after some changes in the streaming code). Additionally, it is
important that to stop the ongoing repair, and not wait for it to complete
its normal operation, because that can take a very long time, and shutdown
is supposed to not take more than a few seconds.
Fixes#699.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455218873-6201-1-git-send-email-nyh@scylladb.com>
We can't move-from in the loop because the subject will be empty in
all but the first iteration.
Fixes crash during node stratup:
"Exiting on unhandled exception of type 'runtime_exception': runtime error: Invalid token. Should have size 8, has size 0"
Fixes update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test (and probably others)
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
* seastar 14c9991...353b1a1 (2):
> scripts: posix_net_conf.sh: Change the way we learn NIC's IRQ numbers
> gate: protect against calling close() more than once
When scylla stopped an ongoing compaction, the event was reported
as an error. This patch introduces a specialized exception for
compaction stop so that the event can be handled appropriately.
Before:
ERROR [shard 0] compaction_manager - compaction failed: read exception:
std::runtime_error (Compaction for keyspace1/standard1 was deliberately
stopped.)
After:
INFO [shard 0] compaction_manager - compaction info: Compaction for
keyspace1/standard1 was stopped due to shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1f85d4e5c24d23a1b4e7e0370a2cffc97cbc6d44.1455034236.git.raphaelsc@scylladb.com>
"This series changes the on-wire definitions of keys to be of the following form:
class partition_key {
std::vector<bytes> exploded();
};
Keys are therefore collections of components. The components are serialized according
to the format specified in the CQL binary protocol. No bit depends now on how we store keys in memory.
Constructing keys from components currently requires a schema reference,
which makes it not possible to deserialize or serialize the keys automatically
by RPC. To avoid those complications, compound_type was changed so that
it can be constructed and components can be iterated over without schema.
Because of this, partition_key size increased by 2 bytes."
For simplicity, we want to have keys serializable and deserializable
without schema for now. We will serialize keys in a generic form of a
vector of components where the format of components is specified by
CQL binary protocol. So conversion between keys and vector of
components needs to be possible to do without schema.
We may want to make keys schema-dependent back in the future to apply
space optimizations specific to column types. Existing code should
still pass schema& to construct and access the key when possible.
One optimization had to be reverted in this change - avoidance of
storing key length (2 bytes) for single-component partition keys. One
consequence of this, in addition to a bit larger keys, is that we can
no longer avoid copy when constructing single-component partition keys
from a ready "bytes" object.
I haven't noticed any significant performance difference in:
tests/perf/perf_simple_query -c1 --write
It does ~130K tps on my machine.
Like we did in commit d54c77d5d0,
make the remaining functions in abstract_replication_strategy return
non-wrap-around ranges.
This fixes:
ERROR [shard 0] stream_session - [Stream #f0b7fda0-cf3e-11e5-b6c4-000000000000]
stream_transfer_task: Fail to send to 127.0.0.4:0: std::runtime_error (Not implemented: WRAP_AROUND)
in streaming.
Message-Id: <514d2a9a1d3b868d213464c8858ac5162c0338d8.1455093643.git.asias@scylladb.com>
The conversion to bytes_view can fail if the key is scattered; so defer that
conversion until later. In a later patch we will intervene before the
conversion to ensure the data is linearized.
Using a partition_key_view can save an allocation in some cases. We will
make use of it when we linearize a partition_key; during the process we
are given a simple byte pointer, and constructing a partition_key from that
requires an allocation.
A large managed_bytes blob can be scattered in lsa memory. Usually this is
fine, but someone we want to examine it in place without copying it out, but
using contiguous iterators for efficiency.
For this use case, introduce with_linearized_managed_bytes(Func),
which runs a function in a "linearization context". Within the linearization
context, reads of managed_bytes object will see temporarily linearized copies
instead of scattered data.
Fixes#884
Time stamps for truncation must be generated after flush, either by
splitting the truncate into two (or more) for-each-shard operations,
or simply by doing time stamping per shard (this solution).
We generate TS on each shard after flushing, and then rely on the
actual stored value to be the highest time point generated.
This should however, from batch replay point of view, be functionally
equivalent. And not a problem.
Since the table is written from all shards, and we possibly might
have conflicting time stamps, we define the trucated_at time
as the highest time point. I.e. conservative.
Truncation records are not portable between us and Origin.
We need to detect and ensure we neither try to use, and more to the
point, don't crash because of data format error when loading, origin
records from a migrated system.
This problem was seen by Tzach when doing a migration from an origin
setup.
Updated record storage to use IDL-serialized types + added versioning
and magic marking + odd-size-checking to ensure we load only correct
data. The code will also deal with records from an older version of
scylla.
gcc 4.9 complains about the type{ val, val } construction of
type with implicit default constructor, i.e. member = initial
declarations. gcc 5 does not (and possibly rightly so).
However, we still (implicitly) claim to support gcc 4.9 so
why not just change this particular instance.
Message-Id: <1454921328-1106-1-git-send-email-calle@scylladb.com>
connection::_pending_requests_gate is responsible for keeping connection
objects alive as long as there are outstanding requests and is closed
in connection::proccess() when needed. Closing it in connection::shutdown()
as well may cause the gate to be closed twice what is a bug.
Fixes#690.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1454596390-23239-1-git-send-email-pdziepak@scylladb.com>
While pkgconfig is supposed to be a distribution and version neutral way
of detecting packages, it doesn't always work this way. The sd_notify()
manual page documents that sd_notify is available via the libsystemd
package, but on centos 7.0 it is only available via the libsystemd-daemon
package (on centos 7.1+ it works as expected).
Fix by allowing for alternate version of package names, testing each one
until a match is found.
Fixes#879.
Message-Id: <1454858862-5239-1-git-send-email-avi@scylladb.com>
Currently, only the shard where the stream_plan is created on will send
streaing mutations. To utilize all the available cores, we can make each
shard send mutations which it is responsbile for. On the receiver side,
we do not forward the mutations to the shard where the stream_session is
created, so that we can avoid unnecessary forwarding.
Note: the downside is that it is now harder to:
1) to track number of bytes sent and received
2) to update the keep alive timer upon receive of the STREAM_MUTATION
To fix, we now store the sent/recieved bytes info on all shards. When
the keep alive timer expires, we check if any progress has been made.
Hopefully, this patch will make the streaming much faster and in turn
make the repair/decommission/adding a node faster.
Refs: https://github.com/scylladb/scylla/issues/849
Tested with decommission/repair dtest.
Message-Id: <96b419ab11b736a297edd54a0b455ffdc2511ac5.1454645370.git.asias@scylladb.com>
The is_reversed function uses a variable length array, which isn't
spec-abiding C++. Additionally, the Clang compiler doesn't allow them
with non-POD types, so this function wouldn't compile.
After reading through the function it seems that the array wasn't
necessary as the check could be calculated inline rather than
separately. This version should be more performant (since it no longer
requires the VLA lookup performance hit) while taking up less memory in
all but the smallest of edge-cases (when the clustering_key_size *
sizeof(optional<bool>) < sizeof(size_type) - sizeof(uint32_t) +
sizeof(bool).
This patch uses relation_order_unsupported it assure that the exception
order is consistent with the preivous version. The throw would
otherwise be moved into the initial for-loop.
There are two derrivations in behavior:
The first is the initial assert. It however should not change the apparent
behavior besides causing orderings() to be looked up 2x in debug
situations.
The second is the conversion of is_reversed_ from an optional to a bool.
The result is that the final return value is now well-defined to be
false in the release-condition where orderings().size() == 0, rather
than be the ill-defined *is_reversed_ that was there previously.
Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454546285-16076-4-git-send-email-erich.keane@verizon.net>
Clang enforces that a union's constexpr CTOR must initialize
one of the members. The spec is seemingly silent as to what
the rule on this is, however, making this non-constexpr results in clang
accepting the constructor.
Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454604300-1673-1-git-send-email-erich.keane@verizon.net>
PHI_FACTOR is a constexpr variable that is defined using std::log.
Though G++ has a constexpr version of std::log, this itself is not spec
complaint (in fact, Clang enforces this). See C++ Spec 26.8 for the
definition of std::log and 17.6.5.6 for the rule regarding adding
constexpr where it isn't specified.
This patch replaces the std::log statement with a version from math.h
that contains the exact value (M_LOG10El).
Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454603285-32677-1-git-send-email-erich.keane@verizon.net>
Array of integral types on little endian machine can be memcpyed into/out
of a buffer instead of serialized/deserialized element by element.
Message-Id: <20160204155425.GC6705@scylladb.com>
It is much easier to see what is going on this way otherwise graphs for
bg mutations and overall mutations are very close with usual scaling for
many workloads.
Message-Id: <20160204083452.GH6705@scylladb.com>
Refs #860
Refs #802
An sstable file set with any component missing is interpreted as a
critical error during boot. Currently sstable removal procedure could
leave the files in a non-bootable state if the process crashed after
TOC was removed but before all components were removed as well.
To solve this problem, start the removal by renaming the TOC file to a
so called "temporary TOC". Upon boot such kind of TOC file is
interpreted as an sstable which is safe to remove. This kind of TOC
was added before to deal with a similar scenario but in the opposite
direction - when writing a new sstable.
Fixes#868.
Registerring exit hooks while reactor is already iterating over exit
hooks is not allowed and currently leads to undefined behavior
observed in #868. While we should make the failure more user friendly,
registering exit hooks concurrently with shutdown will not be allowed.
We don't expect exit hooks to be registered after exit starts because
this would violate the guarantee which says that exit hooks are
executed in reverse order of registration. Starting exit sequence in
the middle of initialization sequence would result in use after free
errors. Btw, I'm not sure if currently there's anything which prevents
this
To solve this problem, move the exit hook to initilization
sequence. In case of tests, the cleanup has to be called explicitly.
Fixes#563.
Refs #584
CQLv2 encodes batch query_options in v1 format, not v2+.
CQLv1 otoh has no batch support at all.
Make read_options use explicit version format if needed.
v2: Ensure we preserve cql protocol version in query_opts
Message-Id: <1454514510-21706-1-git-send-email-calle@scylladb.com>
Currently, we wait for ongoing compaction during shutdown, but
that may take 'forever' if compacting huge sstables with a slow
disk. Compaction of huge sstables will take a considerable amount
of time even with fast disks. Therefore, all ongoing compaction
should be stopped during shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3370f17ce4274df417ea60651f33fc5d4de91199.1454441286.git.raphaelsc@scylladb.com>
Task is stopped by closing gate and forcing it to exit via gate
exception. The problem is that task->compacting_cf may be set to
the column family being compacted, and compaction_manager::remove
would see it and try to stop the same task again, which would
lead to problems. The fix is to clean task->compacting_cf when
stopping task.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3473e93c1a107a619322769d65fa020529b5501b.1454441286.git.raphaelsc@scylladb.com>
The problem is that on the follower side, we set up _session_info too
late, after received PREPARE_DONE_MESSAGE message. The initiator can
send STREAM_MUTATION before sending PREPARE_DONE_MESSAGE message.
To fix, we set up _session_info after we received the prepare_message on
both initiator and follower.
Fixes#869
scylla: streaming/session_info.cc:44: void
streaming::session_info::update_progress(streaming::progress_info):
Assertion `peer == new_progress.peer' failed.
Message-Id: <6d945ba1e8c4fc0949c3f0a72800c9448ba27761.1454476876.git.asias@scylladb.com>
Change the partition_checksum structure to be better suited for the
new serializers:
1. Use std::array<> instead of a C array, as the latters are not
supported by the new serializers.
2. Use an array of 32 bytes, instead of 4 8-byte integers. This will
guarantee that no byte-swapping monkey-business will be done on
these checksums.
The checksum XOR and equality-checking methods still temporarily
cast the bytes to 8-byte chunks, for (hopefully) better performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1454364900-3076-1-git-send-email-nyh@scylladb.com>
Seems like boost::intrusive::set<>::comp() is not accessible on some
versions of boost. Replace by the equivalent
boost::intrusive::set<>::key_comp().
Fixes#858.
Message-Id: <1454326483-29780-1-git-send-email-avi@scylladb.com>
"Make the streaming use standalone tcp connection and send more mutations in
parallel.
It is supposed to help: "Decommission not fully utilizing hardware #849""
The idea behind the current 10 stream_mutations per core limitation is
to avoid streaming overwhelms the TCP connection and starves normal cql
verbs if the streaming mutations are big and takes long time to
complete.
Now that we use a standalone connection for streaming verbs, we can
increase the limitation.
Hopefully, this will fix#849.
In streaming, the amount of data needs to be streamed to peer nodes
might be large.
In order to avoid the streaming overwhelms the TCP connection used by
user CQL verbs and starves the user CQL queries, we use a standalone TCP
connection for streaming verbs.
* dist/ami/files/scylla-ami e284bcd...b2724be (2):
> Revert "Run scylla.yaml construction only once"
> Move AMI dependent part of scylla_prepare to scylla-ami-setup.service
"This moves AMI dependent part of scylla_prepare to scylla-ami repo, make it scylla-ami-setup.service which is independent systemd unit.
Also, it stopped calling scylla_sysconfig_setup on scylla_setup (called on AMI creation time), call it from scylla-ami-setup instead."
Install scylla-ami-setup.service, stop calling scylla_sysconfig_setup on AMI.
scylla-ami-setup.service will call it instead.
Only works with scylla-ami fix.
Fixes#857
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
It actually doesn't called unconditionally now, (since $LOCAL_PKG is always empty) so we can safely remove this.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
To simplify streaming verb handler.
- Use get_session instead of open coded logic to get get_coordinator and
stream_session in all the verb handlers
- Use throw instead of assert for error handling
- init_receiving_side now returns a shared_ptr<stream_result_future>
It is 1:1 mapping between session_info and stream_session. Putting
session_info inside stream_session, we can get rid of the
stream_coordinator::host_streaming_data class.
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) for general questions and help.
# Reporting an issue
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
# Contributing Code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](http://www.scylladb.com/opensource/cla/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
"summary":"Start reporting on one or more collectd metric",
"type":"void",
"nickname":"enable_collectd",
"produces":[
"application/json"
],
"parameters":[
{
"name":"pluginid",
"description":"The plugin ID, describe the component the metric belongs to. Examples are cache, thrift, etc'. Regex are supported.The plugin ID, describe the component the metric belong to. Examples are: cache, thrift etc'. regex are supported",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"instance",
"description":"The plugin instance typically #CPU indicating per CPU metric. Regex are supported. Omit for all",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"type",
"description":"The plugin type, the type of the information. Examples are total_operations, bytes, total_operations, etc'. Regex are supported. Omit for all",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"type_instance",
"description":"The plugin type instance, the specific metric. Exampls are total_writes, total_size, zones, etc'. Regex are supported, Omit for all",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"enable",
"description":"set to true to enable all, anything else or omit to disable",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
@@ -63,10 +114,10 @@
"operations":[
{
"method":"GET",
"summary":"Get a collectd value",
"summary":"Get a list of all collectd metrics and their status",
"type":"array",
"items":{
"type":"type_instance_id"
"type":"collectd_metric_status"
},
"nickname":"get_collectd_items",
"produces":[
@@ -74,6 +125,25 @@
],
"parameters":[
]
},
{
"method":"POST",
"summary":"Enable or disable all collectd metrics",
"type":"void",
"nickname":"enable_all_collectd",
"produces":[
"application/json"
],
"parameters":[
{
"name":"enable",
"description":"set to true to enable all, anything else or omit to disable",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
}
@@ -113,6 +183,20 @@
}
}
}
},
"collectd_metric_status":{
"id":"collectd_metric_status",
"description":"Holds a collectd id and an enable flag",
"summary":"Fetch a string representation of the Scylla version.",
"type":"string",
"nickname":"get_scylla_release_version",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_service/schema_version",
"operations":[
@@ -836,6 +852,22 @@
"type":"string",
"paramType":"query"
},
{
"name":"startToken",
"description":"Token on which to begin repair",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"endToken",
"description":"Token on which to end repair",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"columnFamilies",
"description":"Which column families to repair in the given keyspace. Multiple columns families can be named separated by commas. If this option is missing, all column families in the keyspace are repaired.",
@@ -1169,11 +1201,12 @@
],
"parameters":[
{
"name":"non_system",
"description":"When set to true limit to non system",
"name":"type",
"description":"Which keyspaces to return",
"required":false,
"allowMultiple":false,
"type":"boolean",
"type":"string",
"enum":["all","user","non_local_strategy"],
"paramType":"query"
}
]
@@ -1704,6 +1737,57 @@
}
]
},
{
"path":"/storage_service/slow_query",
"operations":[
{
"method":"POST",
"summary":"Set slow query parameter",
"type":"void",
"nickname":"set_slow_query",
"produces":[
"application/json"
],
"parameters":[
{
"name":"enable",
"description":"set it to true to enable, anything else to disable",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"ttl",
"description":"TTL in seconds",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"threshold",
"description":"Slow query record threshold in microseconds",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
},
{
"method":"GET",
"summary":"Returns the slow query record configuration.",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.