"
Currently affects only counter tables.
Introduced in 27014a2.
mutation_partition(s, mp) is incorrect because it uses s to interpret
mp, while it should use mp_schema.
We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during table schema altering when we receive the
mutation from a node which hasn't processed the schema change yet.
This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.
Fixes#5095.
"
* 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla:
mvcc: Fix incorrect schema verison being used to copy the mutation when applying
mutation_partition: Track and validate schema version in debug builds
tests: Use the correct schema to access mutation_partition
(cherry picked from commit 83bc59a89f)
Propagate the timeout to `consume_mutation_fragments_until()` and hence
to the underlying reader, to ensure queued sstable reads that belong
to timed-out requests are dropped from the queue, instead of
pointlessly serving them.
consume_mutation_fragments_until() received a `timeout` parameter as it
didn't have one.
Fixes: #1068
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190906135629.67342-1-bdenes@scylladb.com>
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.
(cherry picked from commit 060e3f8ac2)
Fixes a segfault when querying for an empty keyspace.
Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.
Fixes#4689
Queries to system.size_estimates table which are not single parition queries
caused Scylla to go into an infinite loop inside multishard_combining_reader::fill_buffer.
This happened because multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for size_estimates_mutation_reader.
This commit fixes the issue and closes#4689.
Move the implementation of size_estimates_mutation_reader
to a separate compilation unit to speed up compilation times
and increase readability.
Refactor tests to use seastar::thread.
It shouldn't rely on argument evaluation order, which is ub.
Fixes#4718.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry-picked from commit 0e732ed1cf)
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.
This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.
Fixes#4786
(cherry picked from commit 9b8ac5ecbc)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.
Fixes#4589
(cherry picked from commit efa7951ea5)
There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.
The keep alive timer is used to close the session in the following case:
n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2
We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.
Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>
(cherry picked from commit b8158dd65d)
Message-Id: <4ebc544c85261873591fd5ac30043e693d74434a.1555466551.git.asias@scylladb.com>
When --abort-on-lsa-bad-alloc is enabled we want to abort whenever
we think we can be out of memory.
We covered failures due to bad_alloc thrown from inside of the
allocation section, but did not cover failures from reservations done
at the beginning of with_reserve(). Fix by moving the trap into
reserve().
Message-Id: <1553258915-27929-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3356a085d2)
allocate_segment() can fail even though we're not out of memory, when
it's invoked inside an allocating section with the cache region
locked. That section may later succeed after retried after memory
reclamation.
We should ignore bad_alloc thrown inside allocating section body and
fail only when the whole section fails.
Fixes#2924
Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dafe22dd83)
Don't hold reference to sstables cleaned up, so that file descriptors
for their index and data files will be closed and consequently disk
space released.
Fixes#3735.
Backport note:
To reduce risk considerably, we'll not backport a mechanism to release
sstable introduced in incremental compaction work.
Instead, only one sstable is passed to table::cleanup_sstables() at a
time (it won't affect performance because the operation is serialized
anyway), to make it easy to release reference to cleaned sstable held
by compaction manager.
tests: release mode.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180914194047.26288-1-raphaelsc@scylladb.com>
(cherry picked from commit 5bc028f78b)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190416025801.15048-1-raphaelsc@scylladb.com>
All schema changes made to the node locally are serialized on a
semaphore which lives on shard 0. For historical reasons, they don't
queue but rather try to take the lock without blocking and retry on
failure with a random delay from the range [0, 100 us]. Contenders
which do not originate on shard 0 will have an extra disadvantage as
each lock attempt will be longer by the across-shard round trip
latency. If there is constant contention on shard 0, contenders
originating from other shards may keep loosing to take the lock.
Schema merge executed on behalf of a DDL statement may originate on
any shard. Same for the schema merge which is coming from a push
notification. Schema merge executed as part of the background schema
pull will originate on shard 0 only, where the application state
change listeners run. So if there are constant schema pulls, DDL
statements may take a long time to get through.
The fix is to serialize merge requests fairly, by using the blocking
semaphore::wait(), which is fair.
We don't have to back-off any more, since submit_to() no longer has a
global concurrency limit.
Fixes#4436.
Message-Id: <1555349915-27703-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3fd82021b1)
The problem happens after a schema change because we fail to properly
remove ongoing compaction, which stopped being tracked, from list that
is used to calculate backlog, so it may happen that a compaction read
monitor (ceases to exist after compaction ends) is used after freed.
Fixes#4410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190409024936.23775-1-raphaelsc@scylladb.com>
(cherry-picked from commit 8a117c338a)
Varint and decimal types serialization did not update the output
iterator after generating a value, which may lead to corrupted
sstables - variable-length integers were properly serialized,
but if anything followed them directly in the buffer (e.g. in a tuple),
their value will be overwritten.
Fixes#4348
Tests: unit (dev)
dtest: json_test.FromJsonUpdateTests.complex_data_types_test
json_test.FromJsonInsertTests.complex_data_types_test
json_test.ToJsonSelectTests.complex_data_types_test
Note that dtests still do not succeed 100% due to formatting differences
in compared results (e.g. 1.0e+07 vs 1.0E7, but it's no longer a query
correctness issue.
(cherry picked from commit 287a02dc05)
When we're populating a partition range and the population range ends
with a partition key (not a token) which is present in sstables and
there was a concurrent memtable flush, we would abort on the following
assert in cache::autoupdating_underlying_reader:
utils::phased_barrier::phase_type creation_phase() const {
assert(_reader);
return _reader_creation_phase;
}
That's because autoupdating_underlying_reader::move_to_next_partition()
clears the _reader field when it tries to recreate a reader but it finds
the new range to be empty:
if (!_reader || _reader_creation_phase != phase) {
if (_last_key) {
auto cmp = dht::ring_position_comparator(*_cache._schema);
auto&& new_range = _range.split_after(*_last_key, cmp);
if (!new_range) {
_reader = {};
return make_ready_future<mutation_fragment_opt>();
}
Fix by not asserting on _reader. creation_phase() will now be
meaningful even after we clear the _reader. The meaning of
creation_phase() is now "the phase in which the reader was last
created or 0", which makes it valid in more cases than before.
If the reader was never created we will return 0, which is smaller
than any phase returned by cache::phase_of(), since cache starts from
phase 1. This shouldn't affect current behavior, since we'd abort() if
called for this case, it just makes the value more appropriate for the
new semantics.
Tests:
- unit.row_cache_test (debug)
Fixes#4236
Message-Id: <1553107389-16214-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 69775c5721)
Introduced in 2a437ab427.
regular_compaction::select_sstable_writer() creates the sstable writer
when the first partition is consumed from the combined mutation
fragment stream. It gets the schema directly from the table
object. That may be a different schema than the one used by the
readers if there was a concurrent schema alter duringthat small time
window. As a result, the writing consumer attached to readers will
interpret fragments using the wrong version of the schema.
One effect of this is storing values of some columns under a different
column.
This patch replaces all column_family::schema() accesses with accesses
to the _schema memeber which is obtained once per compaction and is
the same schema which readers use.
Fixes#4304.
Tests:
- manual tests with hard-coded schema change injection to reproduce the bug
- build/dev/scylla boot
- tests/sstable_mutation_test
Message-Id: <1551698056-23386-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 58e7ad20eb)
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.
Refs #4051.
(master commit 1b7cad3531)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
When bootstrapping, a node should to wait to have a schema agreement
with its peers, before it can join the ring. This is to ensure it can
immediately accept writes. Failing to reach schema agreement before
joining is not fatal, as the node can pull unknown schemas on writes
on-demand. However, if such a schema contains references to UDFs, the
node will reject writes using it, due to #3760.
To ensure that schema agreement is reached before joining the ring,
`storage_service::join_token_ring()` has to checks. First it checks that
at least one peer was connected previously. For this it compares
`database::get_version()` with `database::empty_version`. The (implied)
assumption is that this will become something other than
`database::empty_version` only after having connected (and pulled
schemas from) at least one peer. This assumption doesn't hold anymore,
as we now set the version earlier in the boot process.
The second check verifies that we have the same schema version as all
known, live peers. This check assumes (since 3e415e2) that we have
already "met" all (or at least some) of our peers and if there is just
one known node (us) it concludes that this is a single-node cluster,
which automatically has schema agreement.
It's easy to see how these two checks will fail. The first fails to
ensure that we have met our peers, and the second wrongfully concludes
that we are a one-node cluster, and hence have schema agreement.
To fix this, modify the first check. Instead of relying on the presence
of a non-empty database version, supposedly implying that we already
talked to our peers, explicitely make sure that we have really talked to
*at least* one other node, before proceeding to the second check, which
will now do the correct thing, actually checking the schema versions.
Fixes: #4196
Branches: 3.0, 2.3
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>
(cherry picked from commit 2125e99531)
In case salted_hash was NULL, we'd access uninitialized memory when dereferencing
the optional in get_as<>().
Protect against that by using get_opt() and failing authentication if we see a NULL.
Fixes#4168.
Tests: unit (release)
Branches: 3.0, 2.3
Message-Id: <20190211173820.8053-1-avi@scylladb.com>
(cherry picked from commit da9628c6dc)
"
The code reading counter cells form sstables verifies that there are no
unsupported local or remote shards. The latter are detected by checking
if all shards are present in the counter cell header (only remote shards
do not have entries there). However, the logic responsible for doing
that was incorrectly computing the total number of counter shards in a
cell if the header was larger than a single counter shard. This resulted
in incorrect complaints that remote shards are present.
Fixes#4206
Tests: unit(release)
"
* tag 'counter-header-fix/v1' of https://github.com/pdziepak/scylla:
tests/sstables: test counter cell header with large number of shards
sstables/counters: fix remote counter shard detection
(cherry picked from commit d2d885fb93)
"uuid" was ref:ed in a continuation. Works 99.9% of the time because
the continuation is not actually delayed (and assuming we begin the
checks with non-truncated (system) cf:s it works).
But if we do delay continuation, the resulting cf map will be
borked.
Fixes#4187.
Message-Id: <20190204141831.3387-1-calle@scylladb.com>
(cherry picked from commit 9cadbaa96f)
Change the test so that services are correctly teared down, by the
correct order (e.g., storage_service access the messaging_service when
stopping).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-2-duarte@scylladb.com>
(cherry picked from commit 495a92c5b6)
The original reference points to a thread-local storage object that
guaranteed to outlive the continuation, but copying it make the
subsequent calls point to a local object and introduces a use-after-free
bug.
Fixes#3948
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
(cherry picked from commit 68458148e7)
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.
One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.
This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>
Fixes#3931.
(cherry picked from commit 57e25fa0f8)
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.
Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.
But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.
Fixes#3732 (hopefully)
Tests: unit (release)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
(cherry picked from commit 9f403334c8)
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.
The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.
We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.
However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.
Fixes#3918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
(cherry picked from commit c6811bd877)
get_ranges() is supposed to return ranges in sorted order. However, a35136533d
broke this and returned the range that was supposed to be last in the second
position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which
relied on the sort order to perform a binary search. Other users of the
get_ranges() family did not rely on the sort order.
Fixes#3872.
Message-Id: <20181019113613.1895-1-avi@scylladb.com>
(cherry picked from commit 1ce52d5432)
Fixes#3798Fixes#3694
Tests:
unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)
* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
gms/gossiper: Replicate enpoint states in add_saved_endpoint()
gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
gms/gossiper: Always override states from older generations
(cherry picked from commit 48ebe6552c)
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.
Without this patch, listsnapshots is broken in all versions.
Fixes: #3845
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
(cherry picked from commit 98332de268)
The Antlr3 exception class has a null dereference bug that crashes
the system when trying to extract the exception message using
ANTLR_Exception<...>::displayRecognitionError(...) function. When
a parsing error occurs the CqlParser throws an exception which in
turn processesed for some special cases in scylla to generate a custom
message. The default case however, creates the message using
displayRecognitionError, causing the system to crash.
The fix is a simple workaround, making sure the pointer is not null
before the call to the function. A "proper" fix can't be implemented
because the exception class itself is implemented outside scylla
in antlr headers that resides on the host machine os.
Tested manualy 2 testcases, a typo causing scylla to crash and
a cql comment without a newline at the end also caused scylla to crash.
Ran unit tests (release).
Fixes#3740Fixes#3764
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <cfc7e0d758d7a855d113bb7c8191b0fd7d2e8921.1538566542.git.eliransin@scylladb.com>
(cherry picked from commit 20f49566a2)
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).
However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.
Fix by adding the correct incantation to the file.
Fixes#3799.
Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
(cherry picked from commit aaab8a3f46)
When validating assignment between two types, it's possible one of
them is wrapped in a reverse_type, if it comes, for example, from the
type associated with a clustering column. When checking for weak
assignment the types are correctly unwrapped, but not when checking
for an exact match, which this patch fixes.
Technically, the receiver is never a reversed_type for the current
callers, but this is the morally correct implementation, as the type
being reversed or not plays no role in assignment.
Tests: unit(release)
Fixes#3789
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-1-duarte@scylladb.com>
(cherry picked from commit 5e7bb20c8a)
We need to validate before calling query_options::prepare() whether
the set of prepared statement values sent in the query matches the
amount of names we need to bind, otherwise we risk an out-of-bounds
access if the client also specified names together with the values.
Refs #3688
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814225607.14215-1-duarte@scylladb.com>
(cherry picked from commit 805ce6e019)
Currently, both scylla-housekeeping-daily/-restart services mistakenly
specify repo file path as "@@REPOFILES@@", witch is copied from .in
template, need to be replace with actual path.
Fixes#3776
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180921031605.9330-1-syuu@scylladb.com>
(cherry picked from commit 21a12aa458)
The non-TLS RPC server has an rpc::resource_limits configuration that limits
its memory consumption, but the TLS server does not. That means a many-node
TLS configuration can OOM if all nodes gang up on a single replica.
Fix by passing the limits to the TLS server too.
Fixes#3757.
Message-Id: <20180907192607.19802-1-avi@scylladb.com>
(cherry picked from commit 4553238653)
Secondary index queries do not work correctly when multiple
restrictions are present - the rest of the restrictions is simply
ignored, which results in too many rows returned to the client.
This 2.3 fix makes these unsafe queries return an error instead.
Refs #3754
Message-Id: <7e470052d8ffc5bd8dc12e0d7f2705f0754afdbb.1536243391.git.sarna@scylladb.com>
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.
Fixes#3755
Message-Id: <20180906153609.GJ2326@scylladb.com>
(cherry picked from commit d7674288a9)
Incorrect column_kind was passed, which may cause wrong type to be
used for comparison if schema contains static columns. Affects only
tests.
Spotted during code review.
Message-Id: <1531144991-2658-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 1336744a05)
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 1e56c7dd58)
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.
This may create weird situations like this:
<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
std::out << e << std::endl;
}
<all 10 entries are printed, including the one for "key1">
In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 945d26e4ee)
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.
Fixes#3753.
Message-Id: <20180905131219.GD2326@scylladb.com>
(cherry picked from commit 98092353df)
Change the validity timeout from 1s to 1h in order to avoid false alarms
on busy systems: for a short value there is a chance that
(loading_cache.size() == num_loaders) check is going to run after some elements
of the cache have already been evicted.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180904193026.7304-1-vladz@scylladb.com>
(cherry picked from commit dae70e1166)
Commit e664f9b0c6 transitioned internal
CQL queries in the auth. sub-system to be executed with finite time-outs
instead of infinite ones.
It should have also modified the functions in `auth/roles-metadata.cc`
to have finite time-outs.
This change fixes some previously failing dtests, particularly around
repair. Without this change, the QUORUM query fails to terminate when
the necessary consistency level cannot be achieved.
Fixes#3736.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <e244dc3e731b4019f3be72c52a91f23ee4bb68d1.1536163859.git.jhaberku@scylladb.com>
(cherry picked from commit 682805b22c)
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from 127.0.4.1#0:
std::runtime_error (sorted_tokens is empty in first_token_index!)
To fix, wait for the token range setup before announcing the join
status.
Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test
Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
(cherry picked from commit 89b769a073)
When test.py is run with --jenkins flag Boost UTF is asked to generate
an XML file with the test results. This automatically disables the
human-readable output printed to stdout. There is no real reason to do
so and it is actually less confusing when the Boost UTF messages are in
the test output together with Scylla logger messages.
Message-Id: <20180704172913.23462-1-pdziepak@scylladb.com>
(cherry picked from commit 07a429e837)
When /etc/systemd/system/scylla-server.service.d/capabilities.conf is
not installed, we don't have /etc/systemd/system/scylla-server.service.d/,
need to create it.
Fixes#3738
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180904015841.18433-1-syuu@scylladb.com>
(cherry picked from commit bd8a5664b8)
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.
It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.
Fixes#3625 (hopefully).
Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 4fb3f7e8eb)
"This series introduces a few improvements related to a reload flow.
From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.
It may also safely evict as many items from the cache as needed."
Fixes#3606
* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
utils::loading_cache: hold a shared_value_ptr to the value when we reload
utils::loading_cache::on_timer(): remove not needed capture of "this"
utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
(cherry picked from commit f6aadd8077)
When periodically reloading the values in the loading_cache, we would
iterate over the list of entries and call the load() function for
those which need to be reloaded.
For some concrete caches, load() can remove the entry from the LRU set,
and can be executed inline from the parallel_for_each(). This means we
could potentially keep iterating using an invalidated iterator.
Fix this by using a temporary container to hold those entries to be
reloaded.
Spotted when reading the code.
Also use if constexpr and fix the comment in the function containing
the changes.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712124143.13638-1-duarte@scylladb.com>
(cherry picked from commit 63b63b0461)
The continuation attached to _load() needs the key of the loaded entry
to check whether it was disposed during the load. However if _load()
invalidates the entry the continuation's capture line will access
invalid memory while trying to obtain the key.
To avoid this save a copy of the key before calling _load() and pass it
to both _load() and the continuation.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>
(cherry picked from commit 2e7bf9c6f9)
This error is transient, since as soon as the node is up we will be able
to send the migration request. Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).
The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.
Fixes#3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>
(cherry picked from commit 5792a59c96)
memtable flushes for system and regular region groups run under the
memtable_scheduling_group, but the controller adjusts shares based on
the occupancy of the regular region group.
It can happen that regular is not under pressure, but system is. In
this case the controller will incorrectly assign low shares to the
memtable flush of system. This may result in high latency and low
throughput for writes in the system group.
I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2)
in the dtest: limits_test.py:TestLimits.max_cells_test, which went
away after this.
Fixes#3717.
Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 10f6b125c8)
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.
flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.
This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test
Fixes#3717
Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 2afce13967)
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.
The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.
I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.
The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.
This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.
Fixes#3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 1e50f85288)
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.
Added queries for the in restriction unitest and fixed with a bad result check.
Fixes#2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
(cherry picked from commit d734d316a6)
* dist/ami/files/scylla-ami c7e5a70...b7db861 (2):
> scylla-ami-setup.service: run only on first startup
> Use fstab to mount RAID volume on every reboot
(cherry picked from commit 54ac334f4b)
Since the Linux system abort booting when it fails to mount fstab entries,
user may not able to see an error message when we use fstab to mount
/var/lib/scylla on AMI.
Instead of abort booting, we can just abort to start scylla-server.service
when RAID volume is not mounted, using RequiresMountsFor directive of systemd
unit file.
See #3640
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180824185511.17557-1-syuu@scylladb.com>
(cherry picked from commit ff55e3c247)
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.
This is the code that was removed from some list operations:
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
...
auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);
Put it back, in the form of common code in the update_parameters class.
Fixes#3703
* https://github.com/duarten/scylla cql-list-fixes/v1:
tests/cql_query_test: Test multi-cell static list updates with ckeys
cql3/lists: Fix multi-cell static list updates in the presence of ckeys
keys: Add factory for an empty clustering_key_prefix_view
(cherry picked from commit 6937cc2d1c)
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.
In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.
As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.
Fixes#3688
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
(cherry picked from commit a4355fe7e7)
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.
Fixes#3699.
Message-Id: <20180819131646.GK2326@scylladb.com>
(cherry picked from commit 7277ee2939)
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.
The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.
Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.
Introduced in 99a3e3a.
Fixes#3678.
Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 024b3c9fd9)
On previous commit we moved debian/scylla-server.service to
debian/scylla-server.scylla-server.service to explicitly specify
subpackage name, but it doesn't work for dh_installinit without '--name'
option.
Result of that current scylla-server .deb package missing
scylla-server.service, so we need to rename the service to original
file name.
Fixes#3675
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810221944.24837-1-syuu@scylladb.com>
(cherry picked from commit f30b701872)
"
This series addresses SELECT/INSERT JSON support issues, namely
handling null values properly and parsing decimals from strings.
It also comes with updated cql tests.
Tests: unit (release)
"
Fixes#3666Fixes#3664Fixes#3667
* 'json_fixes_3' of https://github.com/psarna/scylla:
cql3: remove superfluous null conversions in to_json_string
tests: update JSON cql tests
cql3: enable parsing decimal JSON values from string
cql3: add missing return for dead cells
cql3: simplify parsing optional JSON values
cql3: add handling null value in to_json
cql3: provide to_json_string for optional bytes argument
(cherry picked from commit 95677877c2)
When we use str.format() to pass variables on the message it will always
causes Exception like "KeyError: 'red'", since the message contains color
variables but it's not passed to str.format().
To avoid the error we need to pass all format variables to colorprint()
and run str.format() inside the function.
Fixes#3649
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803015216.14328-1-syuu@scylladb.com>
(cherry picked from commit ad7bc313f7)
Currently scylla_ec2_check exits silently when EC2 instance is optimized
for Scylla, it's not clear a result of the check, need to output
message.
Note that this change effects AMI login prompt too.
Fixes#3655
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808024256.9601-1-syuu@scylladb.com>
(cherry picked from commit 15825d8bf1)
Since our scripts were converted to Python, we can no longer
source them from a shell. Execute them directly instead. Also,
we now need to import configuration variables ourselves, since
scylla_prepare, being an independent process, won't do it for
us.
Fixes#3647
Message-Id: <20180802153017.11112-1-avi@scylladb.com>
(cherry picked from commit c9caaa8e6e)
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.
This is consistent with the documentation of the function in its man
page.
As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".
The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):
Some implementations return `NULL` on failure, and others return an
_invalid_ hashed passphrase, which will begin with a `*` and will
not be the same as SALT.
Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.
With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.
Fixes#3637.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
(cherry picked from commit fce10f2c6e)
"
There is an exception safety problem in imr::utils::object. If multiple
memory allocations are needed and one of them fails the main object is
going to be freed (as expected). However, at this stage it is not
constructed yet, so when LSA asks its migrator for the size it may get
a meaningless value. The solution is to remember the size until object
is fully created and use sized deallocation in case of failures.
Fixes#3618.
Tests: unit(release, debug/imr_test)
"
(cherry picked from commit 3b42fcfeb2)
Currently rpc::closed_error is not counted towards replica failure
during read and thus read operation waits for timeout even if one
of the nodes dies. Fix this by counting rpc::closed_error towards
failed attempts.
Fixes#3590.
Message-Id: <20180708123522.GC28899@scylladb.com>
(cherry picked from commit ac27d1c93b)
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.
Fixes#3636
Message-Id: <20180801093147.GD23569@scylladb.com>
(cherry picked from commit 44a6afad8c)
"
This series replaces infinite time-outs in internal distributed
(non-local) CQL queries with finite ones.
The implementation of tracing, which also performs internal queries,
already has finite time-outs, so it is unchanged.
Fixes#3603.
"
* 'jhk/finite_time_outs/v2' of https://github.com/hakuch/scylla:
Use finite time-outs for internal auth. queries
Use finite query time-outs for `system_distributed`
(cherry picked from commit 620e950fc8)
mock outputs files owned by root. This causes attempts
by scripts that want to junk the working directory (typically
continuous integration) to fail on permission errors.
Fixup those permissions after the fact.
Message-Id: <20180719163553.5186-1-avi@scylladb.com>
(cherry picked from commit b167647bf6)
"
This mini-series covers a regression caused by newest versions
of jsoncpp library, which changed the way of quoting UTF-8 strings.
Tests: unit (release)
"
* 'add_json_quoting_3' of https://github.com/psarna/scylla:
tests: add JSON unit test
types: use value_to_quoted_string in JSON quoting
json: add value_to_quoted_string helper function
Ref #3622.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit d6ef74fe36)
Previously CQL grammar wrongfully required INSERT JSON queries
to provide a list of columns, even though they are already
present in JSON itself.
Unfortunately, tests were written with this false assumption as well,
so they're are updated.
Message-Id: <33b496cba523f0f27b6cbf5539a90b6feb20269e.1532514111.git.sarna@scylladb.com>
Fixes#3631.
(cherry picked from commit f66aace685)
"
The problem happens under the following circumstances:
- we have a partially populated partition in cache, with a gap in the middle
- a read with no clustering restrictions trying to populate that gap
- eviction of the entry for the lower bound of the gap concurrent with population
The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.
Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.
The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.
Fixes#3608.
"
* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
row_cache: Fix violation of continuity on concurrent eviction and population
position_in_partition: Introduce is_before_all_clustered_rows()
(cherry picked from commit 31151cadd4)
In case population of the vector throws, the vector object would not
be destroyed. It's a managed object, so in addition to causing a leak,
it would corrupt memory if later moved by the LSA, because it would
try to fixup forward references to itself.
Caused sporadic failures and crashes of row_cache_test, especially
with allocation failure injector enabled.
Introduced in 27014a23d7.
Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3f509ee3a2)
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.
Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.
Fixes#3605.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
(cherry picked from commit cc4acb6e26)
Since some AMIs using consistent network device naming, primary NIC
ifname is not 'eth0'.
But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add
--nic option to specify custom NIC ifname.
Fixes#3584
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712142446.15909-1-syuu@scylladb.com>
(cherry picked from commit ee61660b76)
Drop scylla_lib.sh since all bash scripts depends on the library is
already converted to python3, and all scylla_lib.sh features are
implemented on scylla_util.py.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711114756.21823-1-syuu@scylladb.com>
(cherry picked from commit 58e6ad22b2)
"Converted more scripts to python3."
* 'script_python_conversion2_v2' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_util.py: make run()/out() functions shorter
dist/ami: install python34 to run scylla_install_ami
dist/common/scripts/scylla_ec2_check: move ec2 related code to class aws_instance
dist/common/scripts: drop class concolor, use colorprint()
dist/ami/files/.bash_profile: convert almost all lines to python3
dist/common/scripts: convert node_exporter_install to python3
dist/common/scripts: convert scylla_stop to python3
dist/common/scripts: convert scylla_prepare to python3
(cherry picked from commit 693cf77022)
In the removenode operation, if the message servicing is stopped, e.g., due
to disk io error isolation, the node can keep retrying the
REPLICATION_FINISHED verb infinitely.
Scylla log full of such message was observed:
[shard 0] storage_service - Fail to send REPLICATION_FINISHED to $IP:0:
seastar::rpc::closed_error (connection is closed)
To fix, limit the number of retires.
Tests: update_cluster_layout_tests.py
Fixes#3542
Message-Id: <638d392d6b39cc2dd2b175d7f000e7fb1d474f87.1529927816.git.asias@scylladb.com>
(cherry picked from commit bb4d361cf6)
drop_column_family now waits for both writes and reads in progress.
It solves possible liveness issues with row cache, when column_family
could be dropped prematurely, before the read request was finished.
Phaser operation is passed inside database::query() call.
There are other places where reading logic is applied (e.g. view
replicas), but these are guarded with different synchronization
mechanisms, while _pending_reads_phaser applies to regular reads only.
Fixes#3357
Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <d58a5ee10596d0d62c765ee2114ac171b6f087d2.1529928323.git.sarna@scylladb.com>
(cherry picked from commit 03753cc431)
Current sysconfig_parser.get() returns parameter including double quote,
it will cause problem by append text using sysconfig_parser.set().
Fixes#3587
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180706172219.16859-1-syuu@scylladb.com>
(cherry picked from commit 929ba016ed)
In bash local variable declaration is a separate operation with its own exit status
(always 0) therefore constructs like
local var=`cmd`
will always result in the 0 exit status ($? value) regardless of the actual
result of "cmd" invocation.
To overcome this we should split the declaration and the assignment to be like this:
local var
var=`cmd`
Fixes#3508
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-3-git-send-email-vladz@scylladb.com>
(cherry picked from commit 7495c8e56d)
We break build_ami.sh since we dropped Ubuntu support, scylla_current_repo
command does not finishes because of less argument ('--target' with no
distribution name, since $TARGET is always blank now).
It need to hardcoded as centos.
Fixes#3577
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705035251.29160-1-syuu@scylladb.com>
(cherry picked from commit 3bcc123000)
Use is_debian()/is_ubuntu() to detect target distribution, also install
pystache by path since package name is different between Fedora and
CentOS.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703193224.4773-1-syuu@scylladb.com>
(cherry picked from commit 3cb7ddaf68)
Gentoo Linux was not supported by the node_health_check script
which resulted in the following error message displayed:
"This s a Non-Supported OS, Please Review the Support Matrix"
This patch adds support for Gentoo Linux while adding a TODO note
to add support for authenticated clusters which the script does
not support yet.
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180703124458.3788-1-ultrabug@gentoo.org>
(cherry picked from commit 8c03c1e2ce)
"
In the same way that drivers can route requests to a coordinator that
is also a replica of the data used by the request, we can allow
drivers to route requests directly to the shard. This patchset
adds and documents a way for drivers to know which shard a connection
is connected to, and how to perform this routing.
"
* tag 'shard-info-alt/v1' of https://github.com/avikivity/scylla:
doc: documented protocol extension for exposing sharding
transport: expose more information about sharding via the OPTIONS/SUPPORTED messages
dht: add i_partitioner::sharding_ignore_msb()
(cherry picked from commit 33d7de0805)
Use query::is_single_partition() to check whether the queried ranges are
singular or not. The current method of using
`dht::partition_range::is_singular()` is incorrect, as it is possible to
build a singular range that doesn't represent a single partition.
`query::is_single_partition()` correctly checks for this so use it
instead.
Found during code-review.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f671f107e8069910a2f84b14c8d22638333d571c.1530675889.git.bdenes@scylladb.com>
(cherry picked from commit 8084ce3a8e)
After the transition to the new in-memory representation in
aab6b0ee27 'Merge "Introduce new in-memory
representation for cells" from Paweł'
atomic_cell_or_collection::external_memory_usage() stopped accounting
for the externally stored data. Since, it wasn't covered by the unit
tests the bug remained unnotices until now.
This series fixes the memory usage calculation and adds proper unit
tests.
* https://github.com/pdziepak/scylla.git fix-external-memory-usage/v1:
tests/mutation: properly mark atomic_cells that are collection members
imr::utils::object: expose size overhead
data::cell: expose size overhead of external chunks
atomic_cell: add external chunks and overheads to
external_memory_usage()
tests/mutation: test external_memory_usage()
(cherry picked from commit 2ffb621271)
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.
Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
(cherry picked from commit 59a30f0684)
"
Added NIC / Disk existance check, --force-raid mode on
scylla_raid_setup.
"
* 'scylla_setup_fix4' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_raid_setup: verify specified disks are unused
dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
dist/common/scripts/scylla_sysconfig_setup: verify NIC existance
(cherry picked from commit a36b1f1967)
scylla_install_pkg is initially written for one-liner-installer, but now
it only used for creating AMI, and it just few lines of code, so it should be
merge into scylla_install_ami script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-2-syuu@scylladb.com>
(cherry picked from commit 084c824d12)
"
I found problems on previously submmited patchset 'scylla_setup fixes'
and 'more fixes for scylla_setup', so fixed them and merged into one
patchset.
Also added few more patches.
"
* 'scylla_setup_fix3' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
dist/common/scripts/scylla_raid_setup: fix module import
dist/common/scripts/scylla_setup: check disk is used in MDRAID
dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
dist/common/scripts/scylla_setup: use print() instead of logging.error()
dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
dist/common/scripts/scylla_setup: add a disk to selected list correctly
dist/common/scripts/scylla_setup: fix wrong indent
dist/common/scripts: sync instance type list for detect NIC type to latest one
dist/common/scripts: verify systemd unit existance using 'systemctl cat'
(cherry picked from commit 0b148d0070)
"
If a coordinator sends write requests with ID=X and restarts it may get a reply to
the request after it restarts and sends another request with the same ID (but to
different replicas). This condition will trigger an assert in a coordinator. Drop
the assertion in favor of a warning and initialize handler id in a way to make
this situation less likely.
Fixes: #3153
"
* 'gleb/write-handler-id' of github.com:scylladb/seastar-dev:
storage_proxy: initialize write response id counter from wall clock value
storage_proxy: drop virtual from signal(gms::inet_address)
storage_proxy: do not assert on getting an unexpected write reply
(cherry picked from commit a45c3aa8c7)
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.
Fixes#3557.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
(cherry picked from commit 3194ce16b3)
Introduced in 5b59df3761.
It is incorrect to erase entries from the memtable being moved to
cache if partition update can be preempted because a later memtable
read may create a snapshot in the memtable before memtable writes for
that partition are made visible through cache. As a result the read
may miss some of the writes which were in the memtable. The code was
checking for presence of snapshots when entering the partition, but
this condition may change if update is preempted. The fix is to not
allow erasing if update is preemptible.
This also caused SIGSEGVs because we were assuming that no such
snapshots will be created and hence were not invalidating iterators on
removal of the entries, which results in undefined behavior when such
snapshots are actually created.
Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test
Fixes#3532
Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b464b66e90)
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.
There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.
Disable until a more elaborate fix is implemented.
Fixes#3552Fixes#3553
"
* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
tests: Add test for slicing a mutation source with date tiered compaction strategy
tests: Check that database conforms to mutation source
database: Disable sstable filtering based on min/max clustering key components
(cherry picked from commit e1efda8b0c)
Fixes#3546
Both older origin and scylla writes "known" compressor names (i.e. those
in origin namespace) unqualified (i.e. LZ4Compressor).
This behaviour was not preserved in the virtualization change. But
probably should be.
Message-Id: <20180627110930.1619-1-calle@scylladb.com>
(cherry picked from commit 054514a47a)
"
Cache tracker is a thread-local global object that indirectly depends on
the lifetimes of other objects. In particular, a member of
cache_tracker: mutation_cleaner may extend the lifetime of a
mutation_partition until the cleaner is destroyed. The
mutation_partition itself depends on LSA migrators which are
thread-local objects. Since, there is no direct dependency between
LSA-migrators and cache_tracker it is not guarantee that the former
won't be destroyed before the latter. The easiest (barring some unit
tests that repeat the same code several billion times) solution is to
stop using globals.
This series also improves the part of LSA sanitiser that deals with
migrators.
Fixes#3526.
Tests: unit(release)
"
* tag 'deglobalise-cache-tracker/v1-rebased' of https://github.com/pdziepak/scylla:
mutation_cleaner: add disclaimer about mutation_partition lifetime
lsa: enhance sanitizer for migrators
lsa: formalise migrator id requirements
row_cache: deglobalise row cache tracker
This works around a problem of std::terminate() being called in debug
mode build if initialization of _current throws.
Backtrace:
Thread 2 "row_cache_test_" received signal SIGABRT, Aborted.
0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
#1 0x00007ffff17d077d in abort () from /lib64/libc.so.6
#2 0x00007ffff5773025 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007ffff5770c16 in ?? () from /lib64/libstdc++.so.6
#4 0x00007ffff576fb19 in ?? () from /lib64/libstdc++.so.6
#5 0x00007ffff5770508 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#6 0x00007ffff3ce4ee3 in ?? () from /lib64/libgcc_s.so.1
#7 0x00007ffff3ce570e in _Unwind_Resume () from /lib64/libgcc_s.so.1
#8 0x0000000003633602 in reader::reader (this=0x60e0001160c0, r=...) at flat_mutation_reader.cc:214
#9 0x0000000003655864 in std::make_unique<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (__args#0=...)
at /usr/include/c++/7/bits/unique_ptr.h:825
#10 0x0000000003649a63 in make_flat_mutation_reader<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (args#0=...)
at flat_mutation_reader.hh:440
#11 0x000000000363565d in make_forwardable (m=...) at flat_mutation_reader.cc:270
#12 0x000000000303f962 in memtable::make_flat_reader (this=0x61300001d540, s=..., range=..., slice=..., pc=..., trace_state_ptr=..., fwd=..., fwd_mr=...)
at memtable.cc:592
Message-Id: <1528792447-13336-1-git-send-email-tgrabiec@scylladb.com>
"
The read path on coordinator involves a lot of passing around buffers
and some occasional processing. We start with query::result obtained
from the storage_proxy which is then transformed into a
cql3::result_set, which is then used to write a response. Buffers are
copied and linearised quite excessively.
This series attempts to remedy that by using view of fragmented buffers
as much as possible. The first part deals with reading from
query::result. ser::buffer_view is introduced which enables the IDL
infrastructure to read a buffer without copying or linearising it.
The second part is switching native protocol layer to use bytes_ostream
instead of std::vector<char> to hold the generated response to the
client. The last part introduces cql3::result_generator which is an
alternative to cql3::result_set that passes buffer views without copying
or linearising anything from query::result to the native protocl layer
(or Thrift). It is only used in simple cases, when no processing at the
CQL layer is required, except for paged queries which require some
simple interpretation of the results and are supported by the result
generator.
Tests: unit(release), dtests(paging_test.py paging_additional_test.py
cql_additional_tests.py cql_tracing_test.py cql_prepared_test.py
cql_cast_test.py cql_tests.py)
"
* tag 'buffer-views-query-result/v2' of https://github.com/pdziepak/scylla: (34 commits)
cql3: select_statement: use fetch_page_generator() if possible
pager: add fetch_page_generator()
pager: make the visitor handle_result() accepts a template parameter
pager: make query_result_visitor base class a template parameter
pager: make myvistor a member class of query_pager
pager: make shared pointers to selection constant
pager: merge query_pager and query_pagers::impl
cql3: select_statement: use result_generator if possible
cql3: selection: add is_trivial()
cql3: result: support result_generator
cql3: add lazy result_generator
cql3: add result class
cql3::result_set: fix encapsulation
thrift: use cql3::result_set visiting interface
transport: use cql3::result_set visiting interface
cql3::result_set: add visit()
transport: response: add write_int_placeholder()
transport: steal response buffers and make send zero-copy
transport: use reusable_buffer for compression
transport: response: use bytes_ostream
...
mutation_cleaner has already caused problems by extending lifetime of
mutation_partition past the lifetime of LSA migrators that it uses (due
to the fact that both the cleaner and migrators where thread-local
globals). Since, the long term goal is to make mutation_partition
internal representation depend more and more on schema that lifetime
extension may again cause problems in the future, so let's add a
disclaimer that hopefuly, will help avoiding them.
Current LSA sanitizer performs only basic checks on the migrators use,
without doing any additonal reporting in case an error is detected. This
patch enhances it so that when a problem is detected relevant stack
traces get printed.
object_descriptor uses special encoding for migrator ids which assumes
that the valid ones are in a range smaller than uint32_t. Let's add some
static asserts that make this fact more visible.
Row cache tracker has numerous implicit dependencies on ohter objects
(e.g. LSA migrators for data held by mutation_cleaner). The fact that
both cache tracker and some of those dependencies are thread local
objects makes it hard to guarantee correct destruction order.
Let's deglobalise cache tracker and put in in the database class.
So far query_result_visitor was tied to result_set_builder. The goal is
to enable result_generator to work with paged queries as well so we need
to decouple them.
Shared pointers make code harder to reason about, it is not easy to get
rid of them in this piece of the code, but we can restore at least a bit
of sanity by adding consts.
There is just a single implementation of query_pager and there is no
reason to make anything virtual. Devirtualising this code will allow
higher layers to pass visitors via templates.
cql3::result can now hold either a result_set or a result_generator.
Some code that is not performance critical expects to get result_set so
a way of converting the result_generator to a result_set is added.
result_generator is a restricted alternative of result_set. It supports
only the simples cases, but is much cheaper as it passes data almost
directly from query::result to its visitor bypassing much of the CQL
layer.
So far the only way of returing a result of a CQL query was to build a
result_set. An alternative lazy result generator is going to be
introduced for the simple cases when no transformations at CQL layer are
needed. To do that we need to hide the fact that there are going to be
multiple representations of a cql results from the users.
This visiting interface for result_set satisfies most of its users (at
least all of those which are in the hot path). It will allow having an
alternative of result_set (i.e. lazy result generator) which would
provide exaclty the same interface.
This allows the response writer to defer writing integers until later
time. It will be used by lazy response generator which will know the
number of rows in the response only after they are all written.
Compression algorithms require us to linearise bytes_ostream. This may
cause an excessive number of large allocations. Using reusable_buffers
can avoid that.
std::vector<char> is not a very good container for incrementally
building a response. It may cause excessive copies and allocations. If
the response is large it will put more pressure on the memory allocator
by requiring the buffer to be contiguous.
We already have bytes_ostream which avoids all of these problems, so
let's use it.
So far cql_server::response was passed around using shared pointers.
They have very big cost of making it hard to reason about the code. All
that is not necessary and we can easily switch to using much more
sensible std::unique_ptr.
There are some other translation units which right now are satisfied
with the response being an incomplete type. This means that
std::unique_ptr can't be used for it. Let's move the class declaration
to a header that can be included where needed.
This commit adds a helper class reusable_buffer which can be used to
avoid excessive memory allocations of large buffers when bytes_ostream
needs to be linearised. The idea is that reusable_buffer in most cases
is going to be thread local so that multiple continuation chains can
reuse the same large buffer.
query::result_view already operates on views of a serialised
query::result. However, until now the value of a cell was always
linearised and copied. This patch makes use of ser::buffer_view to avoid
that.
ser::buffer_view is a view of a fragmented buffer in a stream od
IDL-serialised data. It can be used to deserialise IDL objects without
needless copying and linearisation of large blobs.
"
Converted all setup scripts from bash to python3.
"
* 'scripts_python_conversion_v1' of https://github.com/syuu1228/scylla:
dist/common/scripts: convert scylla_kernel_check to python3
dist/common/scripts: convert scylla_ec2_check to python3
dist/common/scripts: convert scylla_sysconfig_setup to python3
dist/common/scripts: convert scylla_setup to python3
dist/common/scripts: convert scylla_selinux_setup to python3
dist/common/scripts: convert scylla_raid_setup to python3
dist/common/scripts: convert scylla_ntp_setup to python3
dist/common/scripts: convert scylla_fstrim_setup to python3
dist/common/scripts: convert scylla_dev_mode_setup to python3
dist/common/scripts: convert scylla_cpuset_setup to python3
dist/common/scripts: convert scylla_cpuscaling_setup to python3
dist/common/scripts: convert scylla_coredump_setup to python3
dist/common/scripts: convert scylla_bootparam_setup to python3
dist/common/scripts: extend scylla_util.py to convert setup scripts to python3
dist/common/scripts: convert scylla_io_setup and scylla_util.py to python3
The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".
An alias is kept to avoid huge code churn.
To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.
Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
This patchset brings support for writing range tombstones to SSTables
3.x. ('mc' format).
In SSTables 3.x, range tombstones are represented by so-called range
tombstone markers (hereafter RT markers) that denote range tombstone
start and end bounds. So each range tombstone is represented in data
file by two ordered RT markers.
There are also markers that both close the previous range tombstone and
open the new one in case if two range tombstones are ajdacent. This is
done to consume less disk space on such occasions.
Range tombstones written as RT markers are naturally non-overlapping.
* github.com:argenet/scylla projects/sstables-30/write-range-tombstones/v6
range_tombstone_stream: Remove an unused boolean flag.
Revert "Add missing enum values to bound_kind."
sstables: Move to_deletion_time helper up and make it static.
sstables: Write end-of-partition byte before flushing the last index
block.
sstables: Add support for writing range tombstones in SSTables 3.x
format.
tests: Add unit test covering simple range tombstone.
tests: Add unit test covering adjacent range tombstones.
tests: Add test to cover non-adjacent RTs.
tests: Add test covering mixed rows and range tombstones.
tests: Add test covering SSTables 3.x with many RTs.
tests: Add unit test covering overlapping RTs and rows.
tests: Add tests writing a range tombstone and a row overlapping with
its start.
tests: Add tests writing a range tombstone and a row overlapping with
its end.
tests: Add function that writes from multiple memtable into SSTables.
tests: Add test where 2nd range tombstone covers the remainder of the
1st one.
tests: Add test writing two non-adjacent range tombstones with same
clustering key prefix at their bounds.
tests: Add test covering overlapped range tombstones.
"
This series addresses issue #3516 and enhances space watchdog to make it
device-aware. It's needed because since last MV-related changes, space
watchdog can be responsible for multiple hints manager, which means
multiple directories, which may mean multiple devices.
Hence, having a single static space size limit is not enough anymore
and watchdog should take it into account that different managers
may work on different disks, while yet another managers can share
the same device.
Tests: unit (release)
"
* 'enhance_space_watchdog_4' of https://github.com/psarna/scylla:
hints: reserve more space for dedicated storage
hints: add is_mountpoint function
hints: make space_watchdog device-aware
hints: add device_id to manager
hints: add get_device_id function
Reserving 10% of space for hints managers makes sense if the device
is shared with other components (like /data or /commitlog).
But, if hints directory is mounted on a dedicated storage, it makes
sense to reserve much more - 90% was chosen as a sane limit.
Whether storage is 'dedicated' or not is based on a simple check
if given hints directory is a mount point.
Fixes#3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.
References #3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
In order to distinguish which directories reside on which devices,
get_device_id function is added to resource manager.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
To porting setup scripts to python3, following utility functions/classes
introduced:
- run(): execute command line, returns return code
- out(): execute command line, returns stdout as string
- is_debian_variant() / is_redhat_variant() / is_gentoo_variant()
/ is_ec2() / is_systemd(): detect specific environment
- hex2list(): implement hex2list.py code as a function
- makedirs(): same as os.makedirs() but do nothing when dir is exists
- dist_name() / dist_ver(): alias of platform.dist()
- class systemd_unit: an utility to control systemd unit using systemctl
- class sysconfig_parser: reader/writer of /etc/sysconfig files
- class concolor: ANSI color escape sequences list
Very often people use the issue tracker to just ask questions. We have
been telling them to close the bug and move the discussion somewhere
else but it would be better if people were already directed to the right
place before they even get it wrong.
This would be easier to everybody.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180621135051.3254-1-glauber@scylladb.com>
This comes in handy when we want to test overlapping range tombstones
because memtable would otherwise de-overlap them internally.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Tests three cases:
- a row lying inside a range tombstone
- a row that has the same clustering key as range tombstone start
- a row that has the same clustering key as range tombstone end
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
These are two RTs where one's RT end clustering is the same as another
one's RT start bound but they are both exclusive.
In this case those bounds should not (and cannot) be merged into a
single RT boundary when writing RT markers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For SSTables 3.x. ('mc' format), range tombstones are represented by
their bounds that are written to the data file as so-called RT markers.
For adjacent range tombstones, an RT marker can be of a 'boundary' type
which means it closes the previous range tombstone and opens the new
one.
Internally, sstable_writer_m relies on range_tombstone_stream to both
de-overlap incoming range tombstones and order them so that when they
are drained they can be easily thought of as just pairs of their bounds.
By default Scylla docker runs without the security features.
This patch adds support for the user to supply different params values for the
authenticator and authorizer classes and allowing to setup a secure Scylla in
Docker.
For example if you want to run a secure Scylla with password and authorization:
docker run --name some-scylla -d scylladb/scylla --authenticator
PasswordAuthenticator --authorizer CassandraAuthorizer
Update the Docker documentation with the new command line options.
Signed-off-by: Noam Hasson <noam@scylladb.com>
Message-Id: <20180620122340.30394-1-noam@scylladb.com>
On current .bash_profile it prints "Constructing RAID volume..." when
scylla_ami_setup is still running, even it running on unsupported
instance types.
To avoid that we need to run instance type check at first, then we can
run rest of the script.
Fixes#2739
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613111539.30517-1-syuu@scylladb.com>
"
Make sure we properly handle row marker and row tombstone
when reading a row.
Tests: unit {release}
"
* 'haaawk/sstables3/read-liveness-info-v4' of ssh://github.com/scylladb/seastar-dev:
sstable: consume row marker in data_consume_rows_context_m
sstable: Add consumer_m::consume_row_marker_and_tombstone
sstable: add is_set and to_row_marker to liveness_info
* https://github.com/vladzcloudius/scylla.git tracing_prepared_parameters-v6:
cql3::query_options: add get_names() method
tracing::trace_state: hide the internals of params_values
tracing: store queries statements for BATCH
tracing: store the prepared statements parameters values
"
A few fixes in scripts that were found when debugging #3508.
This series fixed this issue.
"
Fixes#3508
* 'ami_scripts_fixes-v1' of https://github.com/vladzcloudius/scylla:
scylla_io_setup: properly define the disk_properties YAML hierarchy
scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
scylla_io_setup: print the io_properties.yaml file name and not its handle info
scylla_lib.sh: tolerate perftune.py errors
"
We are seeing some workloads with large datasets where the compaction
controller ends up with a lot of shares. Regardless of whether or not
we'll change the algorithm, this patchset handles a more basic issue,
which is the fact that the current controller doesn't set a maximum
explicitly, so if the input is larger than the maximum it will keep
growing without bounds.
It also pushes the maximum input point of the compaction controller from
10 to 30, allowing us to err on the side of caution for the 2.2 release.
"
* 'tame-controller' of github.com:glommer/scylla:
controller: do not increase shares of controllers for inputs higher than the maximum
controller: adjust constants for compaction controller
"
This mini series fixes some querier-cache related issues discovered
while working on stateful range-scans.
1) A problem in the memory based cache eviction test that is is yet
unexposed (#3529).
2) Possible usage of invalidated iterators in querier_cache (#3424).
3) lookup() possibly returning a querier with the wrong read range
(#3530).
Tests: unit(release)
"
* 'fix-querier-cache-invalid-iterators-master' of https://github.com/denesb/scylla:
querier: find_querier(): return end() when no querier matches the range
querier_cache: restructure entries storage
tests/querier_cache: fix memory based eviction test
batch_statement::verify_batch_size() verifies that the total size of
mutations generated by the batch statement is smaller than certain
configurable thresholds. This is done by a custom mutation_partition
visitor, which violates atomic_cell_view::value() preconditions by
calling it even for dead cells.
The simples solution is to use
mutation_partition::external_memory_usage() instead.
Message-Id: <20180619131405.12601-1-pdziepak@scylladb.com>
When dropping a table, wait for the column family to quiesce so that
no pending writes compete with the truncate operation, possibly
allowing data to be left on disk.
Fixes#2562
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180618193134.31971-1-duarte@scylladb.com>
Patch f39891a999 fixed 3443,
but also introduced a regression in dtest - new column
was unconditionally added to view during ALTER TABLE ADD,
while it should only be the case for "include all columns" views.
This patch fixes the regression (spotted by query_new_column_test).
References #3443
Message-Id: <7410d965255a514d78cf0ce941a3236b9d8ddbbd.1529399135.git.sarna@scylladb.com>
When none of the queriers found for the lookup key match the lookup
range `_entries.end()` should be returned as the search failed. Instead
the iterator returned from the failed `std::find_if()` is returned
which, if the find failed, will be the end iterator returned by the
previous call to `_entries.equal_range()`. This is incorrect because as
long as `equal_range()`'s end iterator is not also `_entries.end()` the
search will always return an iterator to a querier regardless of whether
any of them actually matches the read range.
Fix by returning `_entries.end()` when it is detected that no queriers
match the range.
Fixes: #3530
Currently querier_cache uses a `std::unordered_map<utils::UUID, querier>`
to store cache entries and an `std::list<meta_entry>` to store meta
information about the querier entries, like insertion order, expiry
time, etc.
All cache eviction algorithms use the meta-entry list to evict entries
in reverse insertion order (LRU order). To make this possible
meta-entries keep an iterator into the entry map so that given a
meta-entry one can easily erase the querier entry. This however poses a
problem as std::unordered_map can possibly invalidate all its iterators
when new items are inserted. This is use-after-free waiting to happen.
Another disadvantages of the current solution is that it requires the
meta-entry to use a weak pointer to the querier entry so that in case
that is removed (as a result of a successful lookup) it doesn't try to
access it. This has an impact on all cache eviction algorithms as they
have to be prepared to deal with stale meta-entries. Stale meta-entries
also unnecesarily consume memory.
To solve these problems redesign how querier_cache stores entries
completely. Instead of storing the entries in an `std::unordered_map`
and storing the meta-entries in an `std::list`, store the entries in an
`std::list` and an intrusive-map (index) for lookups. This new design
has severeal advantages over the old one:
* The entries will now be in insert order, so eviction strategies can
work on the entry list itself, no need to involve additional data
structures for this.
* All data related to an entry is stored in one place, no data
duplication.
* Removing an entry automatically removes it from the index as intrusive
containers support auto unlink. This means there is no need to store
iterators for long terms, risking use-after-free when the container
invalidates it's iterators.
Additional changes:
* Modify eviction strategies so that they work with the `entry`
interface rather than the stored value directly.
Ref #3424
Do increment the key counter after inserting the first querier into the
cache. Otherwise two queriers with the same key will be inserted and
will fail the test. This problem is exposed by the changes the next
patches make to the querier-cache but will be fixed before to maintain
bisectability of the code.
Fixes: #3529
This is to stay compliant with the Origin for SSTables 3.x.
It differs from SSTables 2.x (ka/la) as for those the last promoted
index block is pushed first and the end-of-partition byte is written
after.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Right now there is no limit to how much the shares of the controllers
can grow. That is not a big problem from the memtable flush controller,
since it has a natural maximum in the dirty limit.
But the compaction controller, the way it's written today, can grow
forever and end up with a very large value for shares. We'll cap that at
adjust() time by not allowing shares to grow indefinitely.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the controller adjusts its shares based on how big the backlog
is in comparison to shard memory. We have seen in some tests that if the
dataset becomes too big, this may cause compactions to dominate.
While we may change the input altogether in future versions, I'd like to
propose a quick change for the time being: move the high point from 10x
memory size to 30x memory size. This will cause compactions to increase
in shares more slowly.
While this is as magic as the 10 before, they will allow us to err in
the side of caution, with compactions not becoming aggressive enough to
overly disrupt workloads.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This patchset runs the protocol servers under the "statement" scheduling
group, and makes all execution_stages in that path scheduling aware.
I used inheriting_concrete_execution_stage instead of passing the
scheduling group to concrete_execution_stage's constructor for two
reasons:
1. For cql statements, there is no easily accessible object that
can host the concrete_execution_stage and be reached from both
main.cc and the statements,
2. In the future, we will want to assign users to different
scheduling_groups, thus providing performance isolation for
service-level agreements (SLAs). Using an inheriting
execution_stage allows us to make the scheduling_group decision
in one place.
Depends on two unmerged patches in seastar, one fixing
inheriting_concrete_execution_stage compilation with reference parameters,
and one making smp::submit_to() scheduling aware.
"
* tag 'cql-sched/v1' of https://github.com/avikivity/scylla:
cql: make modification_statement execution_stage scheduling aware
cql: make batch_statement execution_stage scheduling aware
cql: make select_statement execution_stage scheduling aware
transport: make native protocol request processing execution_stage scheduling aware
main: start client protocol servers under the statement scheduling group
* seastar e7275e4...6422ece (7):
> build: enable concepts whenever they are supported by compiler
> shared_ptr: Enable releasing ownership of the object stored in lw_shared_ptr
> reactor: change way of calculating task quota violations
> Merge "Add metrics for steal time and task quota violations" from Glauber
> bitops.hh/log2ceil(): add special case for n == 1
> circular_buffer: add clear()
> build: add core/execution_stage.{cc,hh} to core_files
"
Tests: unit (release)
Before merging the LCS controller, we merged patches that would
guarantee that LCS would move towards zero backlog - otherwise the
backlog could get too high.
We didn't do the same for STCS, our first controlled strategy. So we may
end up with a situation where there are many SSTables inducing a large
backlog, but they are not yet meeting the minimum criteria for
compaction. The backlog, then, never goes down.
This patch changes the SSTable selection criteria so that if there is
nothing to do, we'll keep pushing towards reaching a state of zero
backlog. Very similar to what we did for LCS.
"
* 'stcs-min-threshold-v4' of github.com:glommer/scylla:
STCS: bypass min_threshold unless configure to enforce strictly
compaction_strategy: allow the user to tell us if min_threshold has to be strict
If we fail to produce a SizeTiered compaction with the configured
min_threshold, we can try again to compact any two - unless there is a
global bypass telling us no to.
This will still privilege doing larger compactions in size buckets where
that is possible, but if we are idle will try to compact any two
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.
But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.
This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
Implement and test support for reading counters in SSTables 3.
"
* 'haaawk/sstables3/read-counters-v2' of ssh://github.com/scylladb/seastar-dev:
sstable_3_x_test: add test for counters
data_consume_rows_context_m: support reading counters
Add consumer_m::consume_counter_column
Extract make_counter_cell
row.hh & mp_row_consumer.hh: Add required includes
Use serialization_header::adjust in read_statistics
sstables 3: add serialization_header::adjust
data_consume_rows_context_m: add is_column_counter
data_consume_rows_context_m: Remove unused CELL_PATH_SIZE state
column_translation: add is_counter
Currently the SSTable test is failing (at least for me and Raphael),
complaining about the file it tries to write already existing. We have
helpers now to generate temporary directories, so we should use it.
The test passes after that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180614210036.16662-1-glauber@scylladb.com>
In SSTables 3, min timestamp and min deletion time in serialization
header are not stored normally but instead the difference between
their value and the cassandra "epoch" is stored.
This is supposed to make SSTables smaller. As a consequence, we have
to add the "epoch" after reading the values to obtain the actual
values of min timestamp and min deletion time.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.
Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.
Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
Previously max_shard_disk_space_size was unconditionally initialized
with the capacity of hints_directory. But, it's likely that
hints_directory doesn't exist at all if hinted handoff is not enabled,
which results in Scylla failing to boot.
So, max_shard_disk_space_size is now initialized with the capacity
of hints_for_views directory, which is always present.
This commit also moves max_shard_disk_space_size to the .cc file
where it belongs - resource_manager.cc.
Tests: unit (release)
Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>
"
This series adds the following datetime functions to CQL:
- currentTimestamp
- currentDate
- currentTime
- currentTimeUUID
- timeUUIDToDate
- timestampToDate
- timeUUIDToTimestamp
- dateToTimestamp
- timeUUIDToUnixTimestamp
- timestampToUnixTimestamp
- dateToUnixTimestamp
It also comes with datetime conversions test added to cql_query_test.
Note: issue #2949 also mentioned queries like:
$ SELECT * FROM myTable WHERE date >= currentDate() - 2d;
but it's a broader topic of supporting arithmetic operations in general,
so it's moved to #3499.
Tests: unit (release)
"
* 'support_datetime_functions_3' of https://github.com/psarna/scylla:
tests: add datetime conversions to cql_query_tests
cql3: add time conversion functions
cql3: add current* time functions
types: add time_native_type
CentOS 7.4 does support to use ambient capabilities on systemd unit
file, but on some other RHEL7 compatible enviroment doesn't, it causes
Scylla startup failure.
To avoid the issue, move AmbientCapabilities line to
/etc/systemd/system/scylla.server.service.d/, install .conf only when
both systemd and kernel supported the feature.
Fixes#3486
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613232327.7839-1-syuu@scylladb.com>
There is a bug in incremental_selector for partitioned_sstable_set, so
until it is found, stop using it.
This degrades scan performance of Leveled Compaction Strategy tables.
Fixes#3513. (as a workaround)
Introduced: 2.1
Message-Id: <20180613131547.19084-1-avi@scylladb.com>
"
After issue 3501 it turned out that IDL generates incorrect
serialization code for fragmented buffers. This series addresses
the problem by:
* providing serialization code for FragmentRange
* changing IDL generation rules for fragmented buffers, so they
expect a lower layer to iterate over fragments
* adding a test to cql_query_test suite that covers #3501
* adding a test to idl_tests suite that covers fragmented serialization
"
* 'fix_fragmented_serialization_3' of https://github.com/psarna/scylla:
tests: add fragmented serialization test to idl_tests
tests: add long text value test
idl: remove for_each from fragmented serialization
serializer: add FragmentRange serialization
Previously fragmented buffers of bytes were serialized
with a for_each loop. Since serializing bytes involves writing
size first and then data, only first fragment (and its size)
would be taken into account.
This commit changes fragmented code generation so it expects
that serialized range has a serialize(output, T) specification
and expects it to iterate over fragments on its own (just like
serializer for basic_value_view does).
Fixes#3501
Serialization for FragmentRange classes is added to serialization
suite. It first serializes total length to a 32bit field and then
writes each fragment to output.
References #3501
disk_properties map should be an entry in the 'disk' list hierarchy.
Currently this list is going to containe a single element.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
In order to get a file name from the given file() handle one should use
a file_handle.name property.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
When we check the currently configured tuning mode perftune.py is allowed
to return an error. get_tune_mode() has to be able to tolerate them.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Store the prepared statement positional parameters values in the
corresponding system_traces.sessions entry in the 'parameters' column
(which has a map<text,text> type).
Parameters are stored as a pair of "param[X]" : "value", where X is
the index of the parameter starting from 0 and the "value" is the first
64 characters of the parameter's value string representation.
If parameters were given with their names attached (see the description
on bit 0x40 of QUERY flags in the CQL binary protocol specification) then
parameters are going to be stored in the "param[X](<bound variable name>)" : "value"
form.
If the value's string representation is longer than 64 characters then the "value" will
contain only first 64 characters of it and will have the "..." at
the end.
For a BATCH of prepared statements the parameter "name" will have a form of
param[Y][X] where Y is the index of the corresponding prepared statement
in the BATCH and X is the index of the parameter. Both X and Y start from
0.
Note:
Had to switch to boost::range::find() in sstables::big_sstable_set in order to
address the "ambiguous overload" compilation error.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Similarly to the regular QUERY of EXECUTE we want to see the actual
queries statement that were part of the BATCH.
If a traced query has only a single statement to execute then its statement will be stored in a form 'query':'<statement>'.
If there are two or more queries (BATCH) then statements of each query in the BATCH will be stored in a form 'query[X]':'<statement>', where X is the index of the query in the
BATCH starting from 0.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hide it inside the trace_state.cc in order to avoid future circular
dependencies with other .hh files.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
May components limit its internal memory pools/caches/queues depending
on amount of memory present in a system. Each of them uses seastar
memory interface to get the information about memory availability
which makes it harder to 1: test the components with various memory
configurations and 2: to see which components reserve memory and how
much each one reserves.
The patch changes all the components that rely on memory size to get this
information through configuration parameter during creation instead of
checking it directly with seastar, so only main interacts with seastar
allocator.
"
* 'gleb/memory-config-v2' of github.com:scylladb/seastar-dev:
Provide available memory size to compaction_manager object during creation
Configure authorized_prepared_statment_cache memory limit during object creation
Configure logalloc memory size during initialization
Provide cql max request limit to cql server object during creation
Configure query result memory limiter size limit during object creation
Configure querier_cache size limit during object creation
Provide available memory size to messaging_service object during creation
Provide available memory size to hinted handoff resource manager during creation
Provide available memory size to storage_proxy object during creation
Provide available memory size to commitlog during creation
Provide available memory size to database object during creation
Configure prepared_statements_cache memory limit from outside
The test uses random mutations. We saw it failing with bad_alloc from time to time.
Reduce concurrency to reduce memory footprint.
Message-Id: <20180611090304.16681-1-tgrabiec@scylladb.com>
It compares only timestamps, but it should use intrinsic ordering of
the tombstone, which takes deletio ntime into consideration as well.
If we have two range tombstones with the same timestamp but different
deletion time (odd case, but still), then the one with the higher
deletion time should win. That's what all other parts of the system
use to resolve merges, in particular range_tombstone_list and
compact_mutation_state (the fragment stream compactor).
Not respecting this ordering violates the following equality:
do_compact(do_compact(m1) + m2) == do_compact(m1 + m2)
which may results in some clustered rows being missing in the
right-hand side, but not in the left-hand side, due to differences in
range tombstones.
This impacts only tests currently.
Message-Id: <1528705602-7218-1-git-send-email-tgrabiec@scylladb.com>
When I came across db/legacy_schema_migrator.cc, I had no idea what it
does and though I had obvious guesses (it somehow migrates old schemas,
right?) I didn't know what it really does. So after I figured this out,
I wrote this comment so the next person doesn't need to guess.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180605120225.25173-1-nyh@scylladb.com>
* 'systemd-coredump-debian9' of https://github.com/syuu1228/scylla:
dist/debian: fix pystache package name on Debian / Ubuntu
dist/debian: switch to systemd-coredump on Debian 9
dist/debian: rename 99-scylla.conf to 99-scylla-coredump.conf
It is faster than gossiper::is_normal because it avoids to do search in
the std::map<application_state, versioned_value>. It is useful for the
code in the fast path which needs to query if a node is in NORMAL
status.
Fixes#3500
Message-Id: <42db91fa4108f9f4fcf94fed3ec403ccf35d15e9.1528354644.git.asias@scylladb.com>
"
Implement and test support for reading collections in SSTables 3.
Tests: unit {release}
"
* 'haaawk/sstables3/read-collections-v1' of ssh://github.com/scylladb/seastar-dev:
sstables 3: Add tests for reading collections
flat_mutation_reader_assertions: add more flexible asserts
data_consume_rows_context_m: add support for collections
mp_row_consumer_m: Add support for collections
data_consume_rows_context_m: introduce cell_path
Use column_translation::*_is_collection in reading
column_translation: add *_column_is_collection()
column_flags_m: add HAS_COMPLEX_DELETION
Use read_unsigned_vint_length_bytes for COLUMN_VALUE
Use read_unsigned_vint_length_bytes for CK_BLOCKS
Implement read_unsigned_vint_length_bytes
"
This is series is for nodetool getsstables.
This patch is based on:
8daaf9833a
With some minor adjustments because of the code change in sstables.
The idea is to allow searching for all the sstables that contains a
given key.
After this patch if there is a table t1 in keyspace k1 and it has a key
called aa.
curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa"
Will return the list of sstables file names that contains that key.
"
* 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev:
Add the API implementation to get_sstables_by_key
api: column_family.json make the get_sstables_for_key doc clearer
column_family: Add the get_sstables_by_partition_key method
sstable test: add has_partition_key test
sstable: Add has_partition_key method
keys_test: add a test for nodetool_style string
keys: Add from_nodetool_style_string factory method
The get_sstables_by_partition_key method used by the API to return a set of
sstables names that holds a given partition key.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a test to the has_partition_key method, it creates an
sstable with a partition key and then used that key in the
has_partition_key method to verify that it is there.
It creates a different key and use that to verify that a non exist key
return false.
This reverts part of commit 364c2551c8. I mistakenly
changed the scylla-ami submodule in addition to applying the patch. The revert
keeps the intended part of the patch and undoes the scylla-ami change.
In 4b1034b (storage_service: Remove the stream_hints), we removed the
only user of the api with the column_families parameter.
std::vector column_families = { db::system_keyspace::HINTS };
streamer->add_tx_ranges(keyspace, std::move(ranges_per_endpoint),
column_families);
We can simplify the code range_streamer a bit by removing it.
Fixes#3476
Tests: dtest update_cluster_layout_tests.py
Message-Id: <c81d79c5e6dbc8dd78c1242837de892e39d6abd2.1528356342.git.asias@scylladb.com>
It is useful for the client driver to know which shard is serving a
particular connection, so it can only send requests through that connection
which will be served by the same shard, eliminating a hop.
Support that by advertising a "SCYLLA_SHARD" option, with a value
corresponding to the shard number.
Acked-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180606203437.1198-1-avi@scylladb.com>
* seastar 12cffef...e7275e4 (9):
> tests: execution_stage_test: capture sg by value
> Merge "Add in-path parameter suport to the code generation" from Amnon
> Merge "Add scheduling_group inheritance to execution_stage" from Avi
> tutorial: explain how to find origin of exception
> tls: Ensure handshake always drains output before return/throw
> build: cmake: correct stdc++fs library name once more
> perftune.py: make sure config file existing before write
> Update travis-ci integration
> build: fix compilation issues on cmake. missing stdc++-fs
"
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.
Fixes#3483
"
* 'built-indexes-virtual-reader/v2' of github.com:duarten/scylla:
tests/virtual_reader_test: Add test for built indexes virtual reader
db/system_keysace: Add virtual reader for IndexInfo table
db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
index/secondary_index_manager: Expose index_table_name()
db/legacy_schema_migrator: Don't migrate indexes
If reader's buffer is small enough, or preemption happens often
enough, fill_buffer() may not make enough progress to advance
_lower_bound. If also iteartors are constantly invalidated across
fill_buffer() calls, the reader will not be able to make progress.
See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation()
for an examplary scenario.
Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>
There is not reason to use an std::set for it since we don't care about
the ordering - only about the existance of a particular entry.
Hash table will be more efficient for this use case.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528220892-5784-2-git-send-email-vladz@scylladb.com>
"
As in #3423, ensuring token order on secondary index queries can be done
by adding an additional column to views that back secondary indexes.
This column is a first clustering column and contains token value,
computed on updates.
This series also updates tests and comments refering to issue 3423.
Tests: unit (release, debug)
"
* 'order_by_token_in_si_5' of https://github.com/psarna/scylla:
cql3: update token order comments
index, tests: add token column to secondary index schema
view: add handling of a token column for secondary indexes
view: add is_index method
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.
Fixes#3454
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
In 455d5a5 (streaming memtables: coalesce incoming writes), we
introduced the delayed flush to coalesce incoming streaming mutations
from different stream_plan.
However, most of the time there will be one stream plan at a time, the
next stream plan won't start until the previous one is finished. So, the
current coalescing does not really work.
The delayed flush adds 2s of dealy for each stream session. If we have lots
of table to stream, we will waste a lot of time.
We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a
time. If we have 5000 tables, even if the tables are almost empty, the
delay will waste 5000 * 10 * 2 = 27 hours.
To stream a keyspace with 4 tables, each table has 1000 rows.
Before:
[shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
[shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds
After:
[shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
[shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds
Fixes#3436
Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>
Additional token column is now present in every view schema
that backs a secondary index. This column is always a first part
of the clustering key, so it forces token order on queries.
Column's name is ideally idx_token, but can be postfixed
with a number to ensure its uniqueness.
It also updates tests to make them acknowledge the new token order.
Fixes#3423
In order to ensure token order on secondary index queries,
first clustering column for each view that backs a secondary index
is going to store a token computed from base's partition keys.
After this commit, if there exists a column that is not present
in base schema, it will be filled with computed token.
After 70c72773be it's possible that
open_version() is called with a phase which is smaller than the phase
of the latest version, because latest version belongs to the
in-progress cache update. In such case we must return the existing
non-latest snapshot and not create a new version on top of the
in-progress update. Not doing this violates several invariants, and
may lead to inconsistencies, including violation of write atomicity or
temporary loss of writes.
partition_entry::read() was already adjusted by the aforementioned
commit. Do a similar adjustement for open_version().
Fixes sporadic failures of row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528211847-22825-1-git-send-email-tgrabiec@scylladb.com>
We mistakenly only added network-online.target is doens't promises to
wait /var/lib/scylla mount.
To do this we need local-fs.target.
Fixes#3441
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180521083349.8970-1-syuu@scylladb.com>
"
It turns out that compression just works for SSTables 3.x.
Thanks to the previous work done on the write path.
This series cleans up tests a bit and introduces test for compression
on the read path.
"
* 'haaawk/sstables3/read-compression-v1' of ssh://github.com/scylladb/seastar-dev:
Add test for compression in sstables 3.x
Extract test_partition_key_with_values_of_different_types_read
sstable_3_x_test: use SEASTAR_THREAD_TEST_CASE
Drop UNCOMPRESSD_ when code will be used for compressed too
"
This patch adds nr_shards, msb_ignore, and the actual sharding algorithm to the
system.local table. Drivers and other tools can then make use of this
information to talk to scylla in an optimal way
"
* 'system_tables-v3' of github.com:glommer/scylla:
system_keyspace: add sharding information to local table
partitioner: export the name of the algorithm used to do intra-node sharding
We would like the clients to be able to route work directly to the right
shards. To do that, they need to know the sharding algorithm and its
parameters.
The algorithm can be copied into the client, but the parameters need to
be exported somewhere. Let's use the local table for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
v2: force msb to zero on non-murmur
We will export this on system tables. To avoid hard-coding it in the system
table level, keep it at least in the dht layer where it belongs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, build_deb.sh looks very complicated because each of distribution
requires different parameter, and we are applying them by sed command one-by-one.
This patch will replace them by Mustache, it's simple and easy syntax
template language.
Both .rpm distributions and .deb distributions have pystache (a Python
implimentation of Mustache), we will use it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180604104026.22765-1-syuu@scylladb.com>
"
This series introduces a separate hinted handoff manager for materialized views.
Steps:
* decouple resource limits from hinted handoff, so multiple instances can share space
and throughput limits in order to avoid internal fragmentation for every instance's
reservations
* add a subdirectory to data/, responsible for storing materialized view hints
* decouple registering global metrics from hinted handoff constructor, now that there
can be more than one instance - otherwise 'registering metrics twice' errors are going to occur
* add a hints_for_views_manager to storage proxy and route failed view updates to use it
instead of the original hints_manager
* restore previous semantics for enabling/disabling hinted handoff - regular hinted handoff
can be disabled or enabled just for specific datacenters without influencing materialized
views flow
"
* 'separate_hh_for_mv_4' of https://github.com/psarna/scylla:
storage_proxy: restore optional hinted handoff
storage_proxy: add hints manager for views
hints: decouple hints manager metrics from constructor
db, config: add view_pending_updates directory
hints: move space_watchdog to resource manager
hints: move send limiter to resource manager
hints: move constants to resource_manager
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.
Fixes#3483
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the same comment that exists in Apache Cassandra,
explaining that the table_name column in the IndexInfo system table
actually refers to the keyspace name. Don't be fooled.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Expose secondary_index::index_table_name() so knowledge on how to
built an index name can remain centralized.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Because authorized_prepared_statements_cache caches the information that comes from
the permissions cache and from the prepared statements cache it should has the entries
expiration period set to the minimum of expiration periods of these caches.
The same goes to the entry refresh period but since prepared statements cache does have a
refresh period authorized_prepared_statements_cache's entries refresh period
is simply equal to the one of the permissions cache.
Fixes#3473
Tests: dtest{release} auth_test.py
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1527789716-6206-1-git-send-email-vladz@scylladb.com>
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.
Hints for materialized view updates need to be kept somewhere,
because their dedicated hints manager has to have a root directory.
view_pending_updates directory resides in /data and is used
for that purpose.
Constants related to managing resources are moved to newly created
resource_manager class. Later, this class will be used to manage
(potentially shared) resources of hints managers.
"
In preparation, we change LCS so that it tries harder to push data
to the last level, where the backlog is supposed to be zero.
The backlog is defined as:
backlog_of_stcs_in_l0 + Sum(L in level) sizeof(L) * (max_level - L) * fan_out
where:
* the fan_out is the amount of SSTables we usually compact with the
next level (usually 10).
* max_levels is the number of levels currently populated
* sizeof(L) is the total amount of data in a particular level.
Tests: unit (release)
"
* 'lcs-backlog-v2' of github.com:glommer/scylla:
LCS: implement backlog tracker for compaction controller
LCS: don't construct property in the body of constructor
LCS: try harder to move SSTables to highest levels.
leveled manifest: turn 10 into a constant
backlog: add level to write progress monitor
This is the last missing tracker among the major strategies. After
this, only DTCS is left.
To calculate the backlog, we will define the point of zero-backlog
as having all data in the last level. The backlog is then:
Sum(L in levels) sizeof(L) * (max_levels - L) * fan_out,
where:
* the fan_out is the amount of SSTables we usually compact with the
next level (usually 10).
* max_levels is the number of levels currently populated
* sizeof(L) is the total amount of data in a particular level.
Care is taken for the backlog not to jump when a new level has been just
recently created.
Aside from that, SSTables that accumulate in L0 can be subject to STCS.
We will then add a STCS backlog in those SSTables to represent that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now we are constructing the _max_sstable_size_in_mb property in
the body of the constructor, which it makes it hard for us to use from
other properties.
We are doing that because we'd like to test for bounds of that value. So
a cleaner way is to have a helper function for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Our current implementation of LCS can end up with situations in which
just a bit of data is in the highest levels, with the majority in the
lowest levels. That happens because we will only promote things to
highest levels if the amount of data in the current level is higher than
the maximum.
This is a pre-existing problem in itself, but became even clearer when
we started trying to define what is the backlog for LCS.
We have discussed ways to fix this it by redefining the criteria on when
to move data to the next levels. That would require us to change the way
things are today considerably, allowing parallel compactions, etc. There
is significant risk that we'll increase write amplication and we would
need to carefully validate that.
For now I will propose a simpler change, that essentially solves the
"inverted pyramid" problem of current LCS without major disruption:
keep selecting compaction candidates with the same criteria that we do
today, we should help make sure we are not compacting high levels for no
reason; but if there is nothing to do, use the idle time to push data to
higher levels. As an added benefit, old data that is in the higher level
can also be compacted away faster.
With this patch we see that in an idle, post-load system all data is
eventually pushed to the last level. Systems under constant writes keep
behaving the same way they did before.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We increase levels in powers of 10 but that is a parameter
of the algorithm. At least make it into a constant so that we can
reuse it somewhere else.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
SSTables 3.x format ('m') stores the size of previous row or RT marker
inside each row/marker. That potentially allows to traverse rows/markers
in reverse order.
The previous code calculating those sizes appeared to produce invalid
values for all rows except the first one. The problem with detecting
this bug was that neither Cassandra itself nor the sstabledump tool use
those values, they are simply rejected on reading.
From UnfilteredSerializer.deserializeRowBody() method,
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L562
:
if (header.isForSSTable())
{
in.readUnsignedVInt(); // Skip row size
in.readUnsignedVInt(); // previous unfiltered size
}
So while the previous test files were technically correct in that they
contained valid data readable by Cassandra/sstabledump, they didn't
follow the format specification.
This patchset fixes the code to produce correct values and replaces
incorrect data files with correct ones. The newly generated data files
have been validated to be identical to files generated with Cassandra
using same data and timestamps as unit tests.
Tests: Unit {release}
"
* 'projects/sstables-30/fix-prev-row_size/v1' of https://github.com/argenet/scylla:
tests: Fix test files to use correct previous row sizes.
sstables: Fix calculation of previous row size for SSTables 3.x
sstables: Factor out code building promoted index blocks into separate helpers.
"
This patchset contains two fixes to the clustering key prefixes
serialization logic for SSTables 3.x.
First, it fixes a vexing typo: a bitwise-and (&) has been used instead
of a remainder operator (%) for truncating the shift value.
This did not show up in existing tests because they all had non-empty
clustering columns values.
Added tests to cover empty clustering columns values.
Second, it fixes the logic of serialization to write values up to the
prefix length, not the length of the clustering key as defined by
schema. This matches the way it is done by the Origin.
There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
"
* 'projects/sstables-30/fix-clustering-blocks/v1' of https://github.com/argenet/scylla:
tests: Add test covering compact table with non-full clustering key.
sstables: Improve clustering blocks writing, use logical clustering prefix size.
tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
tests: Add unit test covering empty values in clustering key.
sstables: Fix typo in clustering blocks write helper.
"
Add handling for missing columns and tests for it.
There are 3 cases:
1. Number of columns in a table is smaller than 64
2. Number of columns in a table is greater than 64
2a. and less than half of all possible columns are present in sstable
2b. and at least half of all possible columns are present in sstable
Case 1 is implemented using bit mask and column is present if mask & (1 << <column number>) == 0
Case 2 is implemented by storing list of column numbers for each present column
case 3 is implemented by storing list of column numbers for each absent column
"
* 'haaawk/sstables3/read-missing-columns-v3' of ssh://github.com/scylladb/seastar-dev:
sstables 3: add test for reading big dense subset of columns
sstables 3: support reading big dense subsets of columns
sstables 3: add test for reading big sparse subset of columns
sstables 3: support reading big sparse subsets of columns
sstables 3: add test for reading small subset of columns
sstables 3: support reading small subsets of columns
Debug mode view_schema_test sometimes complains that a bool member
doesn't contain in-range values, apparenty in the move constructor.
Initialize them for its benefit to avoid false-positive test
failures.
Message-Id: <20180602184934.31258-1-avi@scylladb.com>
untyped_result_set_row's cell data type is bytes_opt, and the
get_block() accessor accesses the value assuming it's engaged
(relying on the caller to call has()).
has_unsalted_hash() calls get_blob() without calling has() beforehand,
potentially triggering undefined behavior.
Fix by using get_or() instead, which also simplifies the caller.
I observed failures in Jenkins in this area. It's hard to be sure
this is the root cause, since the failures triggered an internal
consistency assertion in asan rather than an asan report. However,
the error is hard to reproduce and the fix makes sense even if it
doesn't prevent the error.
See #3480 for the asan error.
Fixes#3480 (hopefully).
Message-Id: <20180602181919.29204-1-avi@scylladb.com>
Small subset is contains no more than 63 elements.
Support for large subsets will come in the following
patches.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
For SSTables being written, we don't know their level yet. Add that
information to the write monitor. New SSTables will always be at L0.
Compacted SSTables will have their level determined by the compaction
process.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the Origin, the size of the clustering key prefix used during
serialization is the actual length of the prefix and not the full size
as defined in schema. So the code is fixed to align with that logic.
This, in particular, is needed to write clustering blocks for RT
markers.
There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
What supposed to be an operation of taking remainder turned to be a
bitwise 'and'. This didn't show up in existing tests only because they
all had non-empty clustering values.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).
The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.
Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.
For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.
The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.
The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".
Refs #2031.
Refs #2409.
perf_simple_query -c4, medians of 30 results:
./perf_base ./perf_imr diff
read 308790.08 309775.35 0.3%
write 402127.32 417729.18 3.9%
The same with 1 byte values:
./perf_base1 ./perf_imr1 diff
read 314107.26 314648.96 0.2%
write 463801.40 433255.96 -6.6%
The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.
Memory footprint:
Before:
mutation footprint:
- in cache: 1264
- in memtable: 986
After:
mutation footprint:
- in cache: 1104
- in memtable: 866
Tests: unit (release, debug)
"
* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
tests/mutation: add test for changing column type
atomic_cell: switch to new IMR-based cell reperesentation
atomic_cell: explicitly state when atomic_cell is a collection member
treewide: require type for creating collection_mutation_view
treewide: require type for comparing cells
atomic_cell: introduce fragmented buffer value interface
treewide: require type to compute cell memory usage
treewide: require type to copy atomic_cell
treewide: require type info for copying atomic_cell_or_collection
treewide: require type for creating atomic_cell
atomic_cell: require column_definition for creating atomic_cell views
tests: test imr representation of cells
types: provide information for IMR
data: introduce cell
data: introduce type_info
imr/utils: add imr object holder
imr: introduce concepts
imr: add helper for allocating objects
imr: allow creating lsa migrators for IMR objects
imr: introduce placeholders
...
Scylla now expose the prometheus API by default. This patch chagnes
scyllatop to use the Prometheus API, the collect API is still available.
The main changes in the patch:
* Move collectd specific logic inside collectd.
* Add support for help information.
* Add command line to configure prometheus end point and to enable
collectd.
* Add a prometheus class that collect information from prometheus.
Fixes: #1541
Message-Id: <20180531124156.26336-1-amnon@scylladb.com>
Only libjsoncpp >= 1.6.0 offers a safe name() method for value
iterators. For older versions, deprecated memberName() is used
instead. Note that memberName() was deprecated because of its
inability to deal with embedded null characters.
Fixes#3471
Message-Id: <e64a62bfc24ef06daee238d79d557fe6ec8979d3.1527758708.git.sarna@scylladb.com>
With the introduction of the new in-memory representation changing
column type has become a more complex operation since it needs to handle
switch from fixed-size to variable-size types. This commit adds an
explicit test for such cases.
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
Collections are not going to be fully converted to the IMR just yet and
still use the old serialisation format. This means that they still don't
support fragmented values very well. This patch passes the information
when an atomic_cell is created as a member of a collection so that later
we can avoid fragmenting the value in such cases.
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
This commit introduces cell serializers and views based on the in-memory
representation infrastructure. The code doesn't assume anything about
how the cells are stored, they can be either a part of another IMR
object (once the rows are converted to the IMR) or a separate objects
(just like current atomic_cell).
A view schema's view_info contains the id of the base regular column
that view includes in its primary key. Since the column id of a
particular column can potentially change with a new schema version, we
need to refresh the stored column id. We weren't doing that when
unselected base columns are added, and this patch fixes it by
triggering an update of the view schema when base columns are added
and the view contains a base regular column in its PK.
Fixes#3443
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-1-duarte@scylladb.com>
IMR objects may own memory. object_allocator takes care of allocating
memory for all owned objects during the serialisation of their owner.
In practice a writer of the parent object would accept a helper object
created by object_allocator. That helper object would be either
responsible for computing the size of buffers that have to be allocated
or perform the actual serialisation in the same two phase manner as it
is done for the parent IMR object.
In some cases the actual value of an IMR object is not know at the
serialisation time. If the type is fixed-size we can use a placeholder
to defer writing it to a more conveninent moment.
This patch introduces destructors and movers for IMR objects which
enables them to own memory. Custom destructors and methods can be
defined by specialising appropriate classes.
This patch adds new way of serialising bytes and sstring objects in the
IDL. Using write_fragmented_<field-name>() the caller can pass a range
of fragments that would be serialised without linearising the buffer.
Since sstabledump and Cassandra do not use row size values, the new
files have been validated to be identical to files generated by
Cassandra with the same data inserted at same timestamps.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The previous code incorrectly calculated sizes of previous rows while
writing SSTables in 3.x ('m') format.
The problem with detecting this issue was that neither sstabledump nor
Cassandra 3.x itself use those values, as of today, they are simply
ignored when data is read from files.
Still, we want to be compatible and write correct values as they may be
of use in the future.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.
So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
"
Add handling for static rows and tests for it.
"
* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
sstable_3_x_test: Add test_uncompressed_compound_static_row_read
sstable_3_x_test: add test_uncompressed_static_row_read
flat_mutation_reader_assertions: improve static row assertions
data_consume_rows_context_m: Implement support for static rows
mp_row_consumer_m: Implement support for static rows
mp_row_consumer_m: Extract fill_cells
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:
(1) dropping partition entries from cache or memtables does not defer
(2) dropping partition versions abandoned by detached snapshots does not defer
(3) merging of partition versions when snapshots go away does not defer
(4) cache update from memtable processes partition entries without deferring (#2578)
(5) partition entries are upgraded to new schema atomically
This series fixes problems (1), (2) and (4), but not (3) and (5).
(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.
(3) and (5) are not solved yet.
(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.
Remaining work:
- Solving problem (3). I think the approach to take here would be to
move the task of merging versions to the background, maybe into mutation_cleaner.
- Merging range tombstones incrementally.
Performance
===========
Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.
For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.
For small partition case without clustering columns we see no degradation.
For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.
For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.
Below you can see full statistics for cache update run time:
=== Small partitions, no overwrites:
Before:
avg = 433.965155
stdev = 35.958024
min = 340.093201
max = 468.564514
After:
avg = 436.929447 (+1%)
stdev = 37.130237
min = 349.410339
max = 489.953400
=== Small partition with a few rows:
Before:
avg = 315.379316
stdev = 30.059120
min = 240.340561
max = 342.408295
After:
avg = 407.232691 (+30%)
stdev = 53.918717
min = 269.514648
max = 444.846649
=== Large partition, lots of small rows:
Before:
avg = 412.870689
stdev = 227.411317
min = 286.990631
max = 1263.417847
After:
avg = 124.351705 (-70%)
stdev = 4.705762
min = 110.063255
max = 129.643387
=== Large partition, lots of range tombstones:
Before:
avg = 601.172644
stdev = 121.376866
min = 223.502136
max = 874.111572
After:
avg = 695.627588 (+15%)
stdev = 135.057004
min = 337.173950
max = 784.838745
"
* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
mvcc: Use small_vector<> in partition_snapshot_row_cursor
utils: Extract small_vector.hh
mvcc: Erase rows gradually in apply_to_incomplete()
mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
cache: real_dirty_memory_accounter: Move unpinning out of the hot path
mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
mutation_partition: Reduce row lookups in apply_monotonically()
cache: Release dirty memory with row granularity
cache: Defer during partition merging
mvcc: partition_snapshot_row_cursor: Introduce consume_row()
mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
mvcc: Make apply_to_incomplete() work with attached versions
cache: Propagate phase to apply_to_incomplete()
cache: Prepare for incremental apply_to_incomplete()
Introduce a coroutine wrapper
tests: mvcc: Encapsulate memory management details
tests: cache: Take into account that update() may defer
cache: real_dirty_memory_accounter: Allow construction without memtable
cache: Extract real_dirty_memory_accounter
mvcc: Destroy memtable partition versions gently
memtable: Destroy partitions incrementally from clear_gently()
mvcc: Remove rows from tracker gently
cache: Destroy partition versions incrementally
Introduce mutation_cleaner
mvcc: Introduce partition_version_list
mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
database: Add API for incremental clearing of partition entries
cache: Define trivial methods inline
tests: Improve perf_row_cache_update
mutation_reader: Make empty mutation source advertize no partitions
Leverage the fact that it is called with monotonically increasing
positions, and avoid lookups in case the current target entry is the
successor of desired position. Reduces cache update latency by 40%
for large partition in a time-series workload.
This change speeds up merging of partition versions with many rows in
case the merged version has many rows which fall between existing rows
in the target version. This is often the case for time-series
workloads, which insert rows at the front. Lookup can be avoided for
all but the first row in the stride because we already have a
reference to the successor in the target tree, we only need to check
that the current entry in the target tree is still the successor.
This change greatly reduces amount of lookups per row during version
merging of large partitions in time-series workloads.
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
Represents a deferring operation which defers cooperatively with the caller.
The operation is started and resumed by calling run(), which returns
with stop_iteration::no whenever the operation defers and is not
completed yet. When the operation is finally complete, run() returns
with stop_iteration::yes.
This allows the caller to:
1) execute some post-defer and pre-resume actions atomically
2) have control over when the operation is resumed and in which context,
in particular the caller can cancel the operation at deferring points.
It will be used to implement deferring partition_version::apply_to_incomplete().
Curently tests have a single LSA region lock around construction of
managed objects, their manipulation, and access. This way we avoid the
complexity of dealing with allocating sections. That will not be
possible once apply_to_incomplete() is changed to enter an allocating
section itself becasue this requires region to be unlocked at
entry. The tests will have to take more fine-grained locks. That is
somewhat tricky add would add a lot of noise to tests. This patch will
make things easier by abstracting LSA management, among other things,
inside mvcc_conatiner and mvcc_partition classes.
The test incorrectly assumed that once update() is started the
cache will return only versions from last_generation. This will not
hold once we start to defer during partition merging.
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.
Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.
Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.
Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.
Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.
Refs #3289.
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:
stop_iteration clear_gently() noexcept;
It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:
return repeat([this] { return clear_gently(); });
The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
"
This series provides reasoning and clarification for the current
structure of mutate_MV(), and how we handle some scenarios related to
range movements.
"
* 'materialized-views/clarifications/v3' of github.com:duarten/scylla:
db/view: Remove ifdef'd Java code
db/view: Ignore scenario where base replica hasn't joined the ring
db/view: Handle case when base has no paired view replica
"
Add handling for clustering columns and tests for it.
"
* 'haaawk/sstables3/read-ck-v3' of ssh://github.com/scylladb/seastar-dev:
Add test_uncompressed_compound_ck_read for SSTables 3.x
Add test_uncompressed_simple_read for SSTables 3.x
Implement reading clustering key from SSTables 3.x
column_translation: cache fixed value lengths for ck
data_consume_rows_context_m: use cached fixed column value lenghts
column_translation: store fix lengths of column values
consume_row_start: change type of clustering key
Rename ROW_BODY state to CLUSTERING_ROW
We don't need to parse the type every time.
It's better to cache fix lengths of column values
for sstable.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Clustering key in 3.x format is stored differently
so it's easier to create a vector of temporary buffers
instead of a single block of concatenated bytes.
Each temporary buffer stores a value of a single
clustering column.
This is because the way clustering key is stored on disk
in SSTables 3.x is not the same as the way we store it
internally.
This means that we have to first read a value of every
clustering column into temporary_buffer and only then
we can create clustering key using a vector of those
temporary buffers.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Based on:
8daaf9833a
This patch adds a from_nodetool_style_string factory method to partition_key.
The string format is follows the nodetool format, that column in the
partition keys are split by ':'.
For example, if a partition key has two column col1 and col2, to get the
partition key that has col1 = val1 and col2 = val2:
val1:val2
execute_internal() duplicates several code paths, especially in
the select path, for no good reason. It boils down to timeout and
consistency level selection which can be done based on
client_state::is_internal().
This patchset eliminated the duplication and execute_internal(),
simplifying the code.
* github.com:avikivity/scylla cql-no-execute_internal/v2:
cql: schema_altering_statement: make execute() and execute_internal()
equivalent
cql: select_statement: make execute() and execute_internal()
equivalent
cql: query_processor: don't call cql_statement::execute_internal() any
more
cql: cql_statement: remove execute_internal()
As each test completes, report it. This prevents a long-running
test in the beginning of the list from stalling output.
Message-Id: <20180526173517.23078-1-avi@scylladb.com>
"
This series introduces frozen_mutation_fragment which can be used to
send mutation_fragments over the wire to a remote node. The main
intended user is going to be the new streaming implementation.
The first part of the series fixes some IDL issues related to empty
structures and variant being the first member of a structure. Both these
problems make the generated code fail to build and they do not, in any
way, affect the existing on-wire protocol.
Logic responsible for freezing and unfreezing of mutation_fragments is
heavily based on the existing code for freezing mutations and shares the
same drawbacks (for example, unnecessary copy during unfreezing). These
preexisting performance problems can be fixed incrementally.
Another performance problem (which affects frozen_mutations as well, but
to a lesser extent) is that since the batching is done at a different
layer each frozen mutation fragment is a separate bytes_ostream object
owning at least one memory buffer. If the mutation fragments are small
this will cause an excessive number of allocations. This could be solved
either by freezing fragments in batches (though it goes against the RPC
layer doing its own batching) or using bytes_ostream or an equivalent
object with a buffer allocation policy more suitable for such use cases.
This also is something that probably could be an incremental fix.
Tests: unit (release)
"
* tag 'frozen_mutation_fragment/v1-rebased' of https://github.com/pdziepak/scylla:
idl: add idl description of frozen_mutation_fragments
tests: add test for frozen_mutation_fragments
frozen_mutation: introduce frozen_mutation_fragment
tests/idl: test variant being the first member of a structure
idl: create variant state in root node
tests/idl: test serialising and deserialising empty structures
idl-compiler: avoid unused variable in empty struct deserialisers
tests/mutation_reader: disambiguate freeze() overload
Apache Cassandra handles a case where the node hasn't joined the ring
and may consequentially have an outdated view of it. Following the same
reasoning as with the previous patch, we ignore this scenario. It
happens when there are range movements, and this node is bootstrapping,
but there are already other mechanisms in the cluster, such as hinted
handoff and dual-writing to replicas during range movements, that
contribute to this update eventually making its way to the view.
This patch doesn't change any behavior, but it provides the reasoning
why we won't use the batchlog as Cassandra does, or the hinted handoff
log as we will, to later send the update when the node is joined (note
that Cassandra just sends the mutations "later", and doesn't check
again for any condition or change).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If no view replica is paired with the current base replica, it means
there's a range movement going on (decommission or move), such that
this base replica is gaining new token ranges. The current node is
thus a pending_endpoint from the POV of the coordinator that sent the
request.
Sending view updates to the view replica this base will eventually be
paired with only makes a difference when the base update didn't make
it to the node which is currently being decommissioned or moved-from.
The update will, however, make it to that node if HH is enabled at the
coordinator, before the range movement finishes, or later to this node
when it becomes a natural endpoint for the token.
We still ensure we send to any pending view endpoints though, at least
until we handle that case more optimally.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
All cql_statement::execute_internal() overrides now either throw or
call execute(). Since we shouldn't be calling the throwing overrides
internally, we can safely call execute() instead. This allows us to
get rid of execute_internal().
execute_internal(), for some code paths, differs from execute by the
following:
1. it uses CL_ONE unconditionally
2. it has no query timeout
3. it doesn't use execution stages
for other code paths, it just calls execute.
As preparation for getting rid of execute_internal(), unify the two
code paths.
Commit 4859b759b9 caused the consistency level and timeouts
to be provided by the caller, so using the caller provided parameters
instead of overriding them does not change behavior.
"
This patchset makes all users of query_processor specify their timeouts
explicitly, in preparation for the removal of
cql_statement::execute_internal() (whose main function was to override
timeouts).
"
* tag 'cql-explicit-timeouts/v1' of https://github.com/avikivity/scylla:
query_processor: require clients to specify timeout configuration
query_processor: un-default consistency level in make_internal_options
"
Firstly, this patchset removes the is_fixed_length() function of
abstract_type in favour of value_length_if_fixed().
Secondly, it fixed the byte_type to be compatible with Cassandra which
erroneously treats it as a variable-length data type.
Lastly, it adds a unit test covering all non-composite CQL data types
for writing.
Tests: unit {release}
"
* 'projects/sstables-30/different-data-types/v1' of https://github.com/argenet/scylla:
tests: Add a unit test for writing different data types to SSTables 3.x format.
types: Treat byte_type as a variable-length type for compatibility reasons.
types: Remove is_value_fixed() and use value_length_if_fixed() instead.
Although values of the byte_type that corresponds to CQL TINYINT type
always occupy only a single byte, Cassandra treats this it as a
variable-length type for SSTables 3.0 reading and writing.
While it is clearly a mistake at Cassandra side, we have to stay
compatible.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This patch introduces IDL definition as well as serialisers and
deserialisers for freezing mutation_fragment so that they can be
transferred between nodes in a cluster.
Each non-final IDL object is preceeded by a frame containing its size.
In case of boost::variant there is a frame for the variant itself, an
integer determining the active alternative of the variant and a frame of
that active alternative.
However, if a variant was the first member of a writable stub object the
IDL would generate code that would not write the frame for the variant.
This is not a very severe issue since there are no such cases right now
as C++ type system would no allow such generated code to compile.
Deserialisers generated by IDL compiler first create a substream
covering the deserialised structure and then skip and read appropriate
members. If there are no members the substream will be unused and prompt
the compiler to emit a warning.
"
This patch series fixes#3405: secondary-index search only provided
correct results in certain cases, where entire partitions or contiguous
partition slices matched the query. When this was not the case, and
individual clustering rows match or do not match the query, the wrong
results were returned.
To fix this bug, we need to fix the two stages of secondary-index search:
1. In the first stage, we read from the index MV a list of row keys
(i.e., primary keys) matching the query. We can no longer remember
just the partition keys, and need to keep the list of full primary keys.
2. In the second stage, we have a list of rows (not partitions) and need
to read their selected contents to return to the user. Since CQL queries
do not have a syntax to select an arbitrary list of rows, we have to
add new code to do such a selection.
Because we provide an ad-hoc, inefficient, implementation for the row
selection described in stage 2, these patches leave two paths in the code:
The old path, efficiently selecting entire partitions, and the new path,
selecting individual rows. The old path is still used when it is applicable,
which is when a partition key column or the first clustering key column
is searched.
"
* 'si-fix-v4' of http://github.com/nyh/scylla:
secondary index: test multiple clustering column
secondary index: fix wrong results returned in certain cases
secondary index: method for fetching list of rows from base table
secondary index: method for fetching list of rows from index
select_statement.cc: refactor find_index_partition_ranges()
select_statement.cc: fix variable lifetime errors
This patch adds a test for secondary indexes on a table which has many
columns - two partition key column, two clustering key columns, and two
regular columns. We add a bunch of data in various rows and partitions,
index all columns and search on this data and verify the results.
This test exposed various bugs in secondary index search, including
issue #3405. After we fixed those bugs, the test now passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The current secondary-index search code, in
indexed_table_select_statement::do_execute(), begins by fetching a list
of partitions, and then the content of these partitions from the base
table. However, in some cases, when the table has clustering columns and
not searching on the first one of them, doing this work in partition
granularity is wrong, and yields wrong results as demonstrated in
issue #3405.
So in this patch, we recognize the cases where we need to work in
clustering row granularity, and in those cases use the new functions
introduced in the previous patches - find_index_clustering_rows() and
the execute() variant taking a list of primary-keys of rows.
Fixes#3405.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We add a new variant of select_statement::execute() which allows selecting
an arbitrary list of clustering rows. The existing execute() variant can't
do that - it can only take a list of *partitions*, and read the same
clustering rows from all of them.
The new select variant is not needed for regular CQL queries (which do
not have a syntax allowing reading a list of rows with arbitrary primary
keys), but we will need it for secondary index search, for solving
issue #3405.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have a method find_index_partition_ranges(), to fetch a list
of partition keys from the secondary index. However, as we shall see in
the following patches (and see also issue #3405), getting a list of entire
partitions is not always enough - the secondary index actually holds a list
of primary keys, which includes clustering keys, and in some queries we
can't just ignore them.
So this patch provides a new method find_index_clustering_rows(), to
query the secondary index and get a list of matching clustering keys.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The function find_index_partition_ranges() is used in secondary index
searches for fetching a list of matching partition. In a following patch,
we want to add a similar function for getting a list of *rows*. To avoid
duplicate code, in this patch we split parts of find_index_partition_ranges()
into two new functions:
1. get_index_schema() returns a pointer to the index view's schema.
2. read_posting_list() reads from this view the posting list (i.e., list
of keys) for the current searched value.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
do_with() provides code a *reference* to an object which will be kept
alive. It is a mistake to make a copy of this object or of parts of it,
because then the lifetime of this copy will have to be maintained as well.
In particular, it is a mistake to do do_with(..., [] (auto x) { ... }) -
note how "auto x" appears instead of the correct "auto& x". This causes
the object to be copied, and its lifetime not maintained.
This patch fixes several cases where this rule was broken in
select_statement.cc. I could not reproduce actual crashes caused by
these mistakes, but in theory they could have happened.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
This patchset implements reading row columns from SSTable 3 format data file.
Tests: units (release)
"
* 'haaawk/sstables3/read-columns-v4' of ssh://github.com/scylladb/seastar-dev: (21 commits)
Add test for reading column values of different types.
Support all fixed size column types from SSTable 3.x
Add abstract_type::value_length_if_fixed
Add test for simple table with value
flat_reader_assertions: Add produces_row taking column values
Implement reading rows and columns in data_consume_rows_context_m
Introduce column_flags_m
Add column_translation to data_consume_rows_context_m
Pass schema to data_consume_context
Add column_translation.hh
consumer_m: Add consume methods for consuming rows and columns
Extract make_atomic_cell from mp_row_consumer_k_l
Rename NON_STATIC_ROW_* states to ROW_BODY_*
Add liveness_info and use it in reading sstables
Add helper methods for parsing simple types.
Add unfiltered_flags_m::has_all_columns
data_consume_context: use make_unique instead of new
Pass serialization_header to data_consume_rows_context*
Use disk_string_vint_size for bytes_array_vint_size
Introduce disk_string_vint_size type
...
We need to specify --configfile on pdebuild too, otherwise we will
always fail to build .deb on newly created build environment.
Only reason why we still able to build .deb is we already copied
.pbuilderrc to home directory on existing build environment.
Fixes#3456
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523204112.24669-1-syuu@scylladb.com>
It will be needed to obtain column_translation that will
be added to data_consume_context in the next patch.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
New name describes the states in a better way as those states
will be used both for static and non-static rows.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This series introduces a cache of already authenticated prepared statements which
is meant to optimize the prepared statement lookup when authentication is enabled.
This cache allows to perform a single cache lookup per EXECUTE operation as opposed
to at least 2 lookups: one in the prepared statements cache and one in the authentication
cache.
Tests:
- cql_query_test {debug, release}.
- cassandra-stress with authentication enabled and with short eviction timeout.
- Manual (with printouts) checks:
- Tested the eviction due to eviction in the prepared_statements_cache:
- Artificially decreased the prepared_statements_cache size and ran c-s with different keyspaces.
- Verified that the corresponding authorized_prepared_statements_cache entry is evicted and re-populated.
- Tested the BATCH of prepared statements (with dtest infrastructure):
- Verified that for each prepared statement authorized_prepared_statements_cache is updated only once:
- The batch contained a few entries of the same prepared statement.
"
* 'authorized_prepared_statements_cache-v3' of https://github.com/vladzcloudius/scylla:
cql3: use authorized_prepared_statements_cache in the BATCH processing
cql3::statements::batch_statement: introduce a single_statement class
cql3: introduce the authorized_prepared_statements_cache class
loading_shared_values: introduce the templated find() overload
tests: loading_cache_test: add a tests for a loading_cache::remove(key)/remove(iterator)
utils::loading_cache: add remove(key)/remove(iterator) methods
cql3::query_processor: properly stop() prepared_statements_cache object
Since Seastar no longer (1f005fb434) requires libunwind, we can
drop it from our dependency list. This helps the power build, for
which no libunwind is available.
Fixes#3453.
Message-Id: <20180523114750.10753-1-avi@scylladb.com>
"
This series implements the backlog tracker for TWCS, allowing it to
be controlled. The backlog for a TWCS colum family is just the sum of
the SizeTiered backlogs for all the windows that we know about.
A possible optimization for this is to stop tracking windows after
they become old enough and revert to zero backlog. I reverted that
last minute, though, since this will probably cause the backlog to
completely misrepresent reality if we import SSTables into old buckets
with things like repairs or nodetool refresh.
"
* 'twcs-backlog-v4.1' of github.com:glommer/scylla:
backlog: implement backlog tracker for the TWCS
STCS_backlog: allow users to query for the total bytes managed
backlog: keep track of maximum timestamp in write monitor
memtable: also keep track of max timestamp
The TWCS backlog is relatively simple: we just need to keep track of
which SSTable belong to which time window (and actually as usual,
just their sizes). That is an easy thing to do since we can statically
calculate the time bound from the timestamp.
Once we do that we can just sum the backlogs for each individual window.
Time windows that are well enough into the past can be at some point
discarded when their backlogs become zero.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The exploded_clustering_prefix type has a convenient is_empty() method
and an even more convenient "operator bool" shortcut. Unfortunately,
the other clustering prefix types (clustering_key_prefix,
clustering_key_prefix_view) have, for historic reasons, an is_empty
method which takes a schema parameter. That also means they can't
have an "operator bool" shortcut.
But checking if a prefix doesn't really need the schema - all we need to
check is whether the byte representation is empty. The result is simpler
and more efficient code, and easier to use. It is also more consistent -
all clustering-key-related types will have an "operator bool" instead of
just some of them.
To avoid massive code changes, we leave a is_empty(schema) variant, which
simply calls is_empty(). There's already precedent for that - various
methods which have a variant taking schema (and ignoring it) and one
taking nothing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180521174220.13262-1-nyh@scylladb.com>
Like with the EXECUTE command avoid authorizing the same prepared
statement twice - this time in the context of processing the BATCH
command.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This is a helper class needed to control the handling process of a single
statement in the current batch. In particular it has the boolean defining
if the authorization is needed for this statement.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Add a cache that would store the checked weak pointer to already authorized prepared statements
and which key is a tuple of an authenticated_user and key of the prepared_statements_cache.
The entries will be held as long as the corresponding prepared statement is valid (cached)
and will be discarded with the period equal to the refresh period of the permissions cache.
Entries are also going to be discarded after 60 minutes if not used.
The purpose of this new cache is to save the lookup in the permissions cache for already authenticated
resource (whatever is needed to be authenticated for the particular prepared statement).
This is meant to improve the cache coherency as well (since we are going to look in a single cache
instead of two).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This overload alows searching the elements by an arbitrary key as long as it is "hashable"
to the same values as the default key and if there is a comparator for
this new key.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
remove(key): removes the entry with the given key if exists, otherwise does nothing.
remote(iterator): removes an entry by a given iterator (returned from loading_cache::find()).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This commit makes sure that hints manager is always initialized,
including creating hints directories and starting it.
It needs to be fixed because hints manager is internally used
to store failed materialized view replicas.
Fixes#3451
Message-Id: <44532fd3704e20cabeb9c4985dace5650fd22d2c.1527018865.git.sarna@scylladb.com>
"
This series addresses issue #3202 about dropping a table with secondary
indexes present. Previously dropping such tables was impossible due to
materialized view restrictions (which is an implementation detail
of Scylla's secondary indexes).
Implemented:
* fixing 'DROP KEYSPACE' with active materialized views
* adapting schema_builder to make it easy to drop indexes
* dropping all dependent SI before dropping a table
* a test case for dropping a table with secondary indexes
"
* 'drop_si_before_drop_table_3' of https://github.com/psarna/scylla:
tests: add test for dropping a table with secondary indexes
migration_manager: allow dropping table with secondary indexes
schema: add clearing indexes to schema builder
database: do not truncate already removed views
This commit adds a test case for dropping a table with dependent
secondary indexes. Dependent materialized views prohibit the table
from being dropped, but dropping a table with dependent SI is legal.
References #3202
Previously dropping a table with secondary indexes failed, because
SI are internally backed by materialized views.
This commit triggers dropping dependent secondary indexes before
dropping a table.
Fixes#3202
This commit clears table's views before truncating it
in drop_column_family function. The only case when
views are not empty during drop is when they're backing secondary
indexes of a base table and they are all atomically dropped
in the same go as the base table itself.
This change will prevent trying to truncate views that were
already dropped, which used to result in no_such_column_family error.
References #3202
"
This series introduces materialized view statistics, as stated in issue #3385:
- updates pushed
- updates failed
- row lock stats
It also addresses issue #3416 by decoupling user write stats from view
update stats.
"
* 'materialized_view_metrics_9' of https://github.com/psarna/scylla:
view: adapt view_stats to act as write stats
storage_proxy: decouple write_stats from stats
db: add row locking metrics
view: add view metrics
We would like to know whether there is still backlog at rest in a
particular STCS object. This is useful, for instance, in the TWCS
backlog, that uses STCS so it can delete old windows that are no longer
used.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
For sealed SSTables we can get the maximum timestamp from the statistics
component. But for partially written SSTables, the metadata is not yet
available.
One way to solve this would be to make the SSTable statistics available
earlier. But we would end up with a maximum timestamp that potentially
changes all the time as we write more cells.
A better approach is to take note of what's the maximum timestamp in a
memtable before we start to flush, and when time comes for us to flush
we will use the progress manager to inform the consumers about the
maximum timestamp.
For SSTables being compacted, we can't know for sure what is the maximum
timestamp as some entries could be TTLd already. But the maximum of all
SSTables present in the compaction is a good enough estimation for this
purposes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are now keeping track of the minimum timestamp in a memtable. Also
keep track of the max timestamp so we can know what it is before we
finish flushing the entire memtable to an SSTable. Will be used by
partially written SSTables undergoing TWCS.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This was sent before as two separate patchsets. It is now unified
because it has a lot of common infrastructure.
In this patchset I am aiming at two goals:
1) Provide a minimum amount of shares for user-initiated operations like
nodetool compact and nodetool cleanup
2) Be more robust with exceptions in the backlog tracker
For the first, the main difference is that I now made the compaction
controller a part of the compaction manager. It then becomes easy to
consult with the compaction controller for the correct amount of shares
those operations should have.
In compaction_strategy.cc, the major_compaction_strategy object was
actually already unused before. So instead of making use of it, which
would require some form of information flow downwards about the backlog
we need to export, I am creating a user-initiated backlog type inside
the compaction manager.
With the two changes described above everything is very well
self-contained within the compaction manager and the implementation
becomes trivial.
For the second, I am now handling exceptions in two places:
1) the backlog computation. Those are const functions so if we just have
a transient exception when compacting the backlog, all we need to do is
return some fixed amount of shares and try again in the next adjustment
window.
2) the process of adding / removing SSTables. Those are harder, since if
we fail to manipulate the list we'll be left in an inconsistent state.
The best approach is then to disable the backlog tracker and return a
fixed amount of shares globally.
Tests: unit (release)
"
* 'backlog-improvements-v3' of github.com:glommer/scylla:
compaction_manager: disable backlog tracker if we see an exception
backlog tracker: protect against exceptions in backlog calculation.
STCS_backlog: protect against negative backlog
STCS_backlog: remove unused attribute
compaction strategy: move size tiered backlog to a header
compaction_strategy: delete major_compaction_strategy class
compaction: make sure that user-initiated compactions always have a minimum priority
backlog_controller: add constants to represent a globally disabled controller
backlog_controller: move compaction controller to the compaction manager
backlog_controller: allow users to compute inverse function of shares
This commit adapts view_stats structure so it can be passed
to storage_proxy as write stats. Thanks to that, mv replica updates
will not interfere with user write metrics. As a side effect it also
provides more stats to replica view updates.
Closes#3385Closes#3416
This commit extracts metrics related to writes from stats structure,
so it can be easily replaced later, e.g. for materialized view metrics.
References #3385
References #3416
This commit adds statistics to row_locker class. Metrics are
independendly counted for all lock types: row<->partition and
exclusive<->shared.
Metrics gathered:
- total acquisitions
- operations that wait on the lock
- histogram of the time spent on waiting on this type of lock
References #3385
References #3416
This commit introduces view statistics:
- updates pushed to local/remote replicas
- updates failed to be pushed to local/remote replicas
Metrics are kept on per-table basis, i.e. updates_pushed_remote
shows the number of total updates (mutations) pushed to all paired
mv replicas that this particular table has.
Every single update is taken into consideration, so if view update
requires removing a row from one view and adding a row to another,
it will be counted as 2 updates.
References #3385
References #3416
Compactor collects all currently active memtables and later replaces
them with the merged result. The problem is that active memtable
belongs to the input set during compaction and as a result mutations
applied concurrently with compaction could be lost once compaction
replaces the memtables. The fix is to open a new active memtable when
compaction starts.
Caused sporadic failures of row_cache_test.cc:test_continuity_is_populated_when_read_overlaps_with_older_version()
Message-Id: <1526997724-13037-1-git-send-email-tgrabiec@scylladb.com>
If we see an exception when adding or removing SSTables from the backlog
tracker, the backlog tracker can be inconsistent forever. It would be
best if we act before that happens and disable the backlog tracker. Once
the backlog tracker is disabled it will default to returning a fixed
number of shares.
We can either disable the backlog tracker or remove it. But if we remove
it we can end up with a backlog of zero if that's the only tracker with
a backlog. We then keep it registered but mark it as disabled. This also
leaves room for recovery in some situations: we can recover the backlog
by a doing a schema change in the column family that had the backlog
disabled, for instance.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Backlog calculations should be exception free, but there are at cases in
which I can see they happening. One example is if some backlog tracker
that uses temporary objects fails an allocation.
Memory shortages can be specially pernicious: if we leave the
responsibility of catching those to the individual backlog tracker, we
will keep trying to make more allocations in the other backlog trackers
if we have many column families. By handling it here we can stop that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
A negative backlog can be interpreted as a very large backlog.
Part of that is because we keep the total_size as an unsigned type,
which is what we expect. But in case there is an issue-- like an
exception that causes some SSTable not to be tracked then this size
can become negative. Returning a zero backlog is better than allowing
it to be interpreted as a giant number.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This attribute ended up being unused in the final version.
Spotted now while reading the code for other purposes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It's very common to other strategies to include a SizeTiered
step somehow inside their algorithms: LCS will do SizeTiered on
L0, TWCS will do SizeTiered within a window, etc.
To make it easier for those strategies to consume the SizeTiered
backlog tracker, we will move that to its own file.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It was already unused before this series. In an earlier version I have
used it to provide an ad-hoc backlog for major compactions. But now that
this is done by the compaction manager, this class really isn't being
used.
And it is likely it won't be: major compaction is not a compaction
strategy a user can choose, unlike the others that need to be built
through make_compaction_strategy.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have observed the following behavior with user initiated compactions,
like major compactions:
- if there are no writes, the backlog doesn't increase.
- as compaction progresses the backlog decreases.
- at some point, the backlog is so low that compaction barely makes any
progress.
Going forward, we should allow one to read from the generated partial
SSTables, in which case this doesn't matter that much. But for
user-iniated compactions we would like to guarantee a minimum baseline.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are situations in which we want the controllers to stop working
altogether. Usually that's when we have an unimplemented controller or
some exception.
We want to return fixed shares in this case, but this is a very
different situation from when we want fixed shares for *one* backlog
tracker: we want to return fixed shares, yes, but if we disable 200
backlog trackers (because they all failed, for instance), we don't want
that fixed number x 200 to be our backlog.
So the mechanism to globally disable the controller is still granted,
and infinity is a good way to represent that. It's a float that the
controller can easily test against. But actually using infinity in the
code is confusing. People reading it may interpret it as the other way
around from what it means, just meaning "a very large backlog".
Let's turn that into a constant instead. It will help us convey meaning.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.
Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.
Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Fixes#3446
Previously, only shutdown-synced objects where actually closed,
which is wrong.
This introduces yet another queue, processed together with the
deletion objects, which ensures we explicitly close all objects
that have been discarded.
Message-Id: <20180521140456.32100-1-calle@scylladb.com>
"This series leverages hinted handoff for failed view replica
updates."
* 'materialized_view_updates_with_hh_5' of https://github.com/psarna/scylla:
storage_proxy: enable hinted handoff for materialized views
storage_proxy: make view updates use consistency_level::ANY
row::find_cell() may be called for cells that do not exist in that row.
In such case nullptr shall be returned, this patch makes sure that
it is not dereferenced.
Message-Id: <20180522091726.24396-1-pdziepak@scylladb.com>
There are some situations in which we want to force a specific amount of
shares and don't have a backlog. We can provide a function to get that
from the controller.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar a6cb005...5da5d4e (6):
> append_challenged_posix_file_impl: Ensure continuation uses non-stale object
> utils: make make_visitor() public
> tcp: Adjust receive window
> tcp: Fix allowed sending size calculation in can_send
> tcp: Fix assert in tcp::tcb::output_one
> be more descriptive with failed syscalls for filesystem operations
Contains alternative fix for #3446 (will also be fixed directly).
This commit initializes and enables hinted handoff for materialized
views, even if HH is not explicitly turned on in config.
User writes still use hinted handoff only if it is explicitly enabled,
while materialized views are allowed to use it unconditionally
in order to store failed replica updates somewhere.
Fixes#3383
This commit makes view replica updates internally use consistency
level ANY, so in case an update fails it will fall back to hinted
handoff.
References #3383
install(1) creates missing directories on recent Fedora, but not
on CentOS 7. This causes the RPM build (which installs to a pristine
tree, without an existing /etc) to fail.
Fix by setting up /etc.
Tests: rpm (Fedora, CentOS)
Message-Id: <20180520124937.20466-1-avi@scylladb.com>
There is no need to call dht::split_ranges_to_shards to split the token
range into <shard> : <a lot of small ranges> mapping and create a flat
mutation reader with a lot of small ranges.
Because:
1) The flat mutation reader on each shard only returns data belongs to
this local shard, there is no correctness issue if we do not split and
feed the sub ranges only belongs to this local shard.
2) With murmur3_partitioner_ignore_msb_bits = 12, it is almost certain
that given a token range, all the shards will have data for the range
anyway. Even if we ask all the shards to work on the token range and
some of the shards have no data for it, it is fine. We simply send no
data from this shard.
Tests: update_cluster_layout_tests.py
Message-Id: <ac00cd21d6156c47b74451dd415d627481e48212.1526864222.git.asias@scylladb.com>
In streaming, the sender sends the mutations on all the local shards in
parallel, it is possible that the receiver handle more than one such
connection on the same shard. It is determined by where the tcp
connection goes. Current rpc ignores the dest shard id when sending the
rpc message.
For instance, say node1 has 2 shards, node2 has 2 shards. Currently, we
can end up with like this:
Node 1 shard 0 -> Node 2 shard 1
Node 1 shard 1 -> Node 2 shard 1
It is better if we do:
Node 1 shard 0 -> Node 2 shard 0
Node 1 shard 1 -> Node 2 shard 1
This patch solves this problem by let the handler always handle on
shard = src_cpu_id % smp::count.
If sender and receiver have the same shard config, it is completely
distributed the work evenly.
If sender and receiver do not have the same shard config, it is
unavoidable some of the shard will do more work than the others.
Tests: dtest update_cluster_layout_tests.py
Message-Id: <911827bcf67459a07ec92623a9ed4c4fbba195ca.1524622375.git.asias@scylladb.com>
Fixes#2793
Prints error handle class (commitlog or "other/disk") + exception
type and message. While not exhaustive, at least gives a correlation
point to (hopefully) other log printouts.
Message-Id: <20180509081040.7676-1-calle@scylladb.com>
"
For compression, SSTables 3.x format uses CRC32 for checksumming
compressed chunks as well as for calculating the full file checksum.
Also, while for older formats "full checksum" of a compressed data file
means a combination of checksums of its compressed chunks, in SSTables
3.x this now reads literally and assumes the checkum of all bytes
written, including per-chunk digests.
Tests: unit {debug, release}
"
* 'projects/sstables-30/write-compression/v3' of https://github.com/argenet/scylla:
tests: Add unit tests for writing compressed SSTables 3.x.
tests: Validate Digest32.crc for SSTables 3.x write tests.
tests: Fix invalid Digest file for write_counter_table test.
sstables: Support writing compressed SSTables 3.0.
sstables: Make compressed streams customizable on checksumming.
sstables: Move checksum calculation logic to compressed_output_stream.
Previously, compressed_output_stream used to calculate checksum of the
supplied chunk and pass it to the 'compression' object to combine with
the full checksum calculated on prior writes.
Now, all the checksum calculation happens inside
compressed_output_stream and 'compression' only stores the result.
This is done to loosen ties between two classes and simplify
compressed_output_stream customisation with various checksum algorithms.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.
The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
* tag 'tgrabiec/fixes-and-improvements-for-gdb-scripts-v1' of github.com:tgrabiec/scylla:
gdb: Print live object size from 'scylla lsa-segment'
gdb: Extend 'scylla segment-descs' output with full occupancy info
gdb: Print allocated object's type name instead of full LSA migrator
gdb: Fix LSA migrator discovery
gdb: Drop code related to LSA zones
gdb: Fix uses of removed segment_desctriptor::_lsa_managed
lsa: Add use for debug::static_migrators
Move code to a traditional install.sh script (more traditional would be
a "make install", but this is close enough).
This allows testing installation independently of packaging. In addition,
non-Red Hat-packaging can share much of the code in install.sh.
Ref #3243.
Tests: build+install rpm
Message-Id: <20180517114147.30863-1-avi@scylladb.com>
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.
Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.
Fixes#3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".
I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:
c3f9e6ce1f/tests/perf_row_cache_update.cc
Other workloads, and other allocation sites probably also could see the
improvement.
"
* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
lsa: Expose counters for allocation and compaction throughput
lsa: Reduce amount of segment compactions
lsa: Avoid the call to segment_pool::descriptor() in compact()
lsa: Make reclamation on reserve refill more efficient
Fixes#3339
* seastar 840002c...0a1a327 (7):
> Merge "fix perftune.py issues with cpu-masks on big machines" from Vlad
> Merge 'Handle Intel's NICs in a special way' from Vlad
> reactor: fix calculation of idle ticks
> log: streamline logging internals a little
> Merge "CMake imrovements and compatibility" from Jesse
> iotune: fix typo in property name
> cmake: do not find_package(Boost ...) if Boost is a target
"
This patchset adds support for writing counter cells in SSTables 3.x
format ('m'). The logic of writing counters is almost identical to that
used for the old 2.x format ('k'/'l') with the only difference that the
data length preceding serialised shards is written as a vint.
Tests: unit {release}.
Generated SSTables are verified to be processed fine by sstabledump
(note that sstabledump only outputs the binary data for counters, not
their actual values, same as sstable2json).
Verified with Cassandra 3.11 to get the expected values from the
counters table:
cqlsh> SELECT * from sst3.counter_table;
pk | ck | rc1 | rc2
-----+-----+-----+-----
key | ck1 | 10 | 1
(1 rows)
Verified that the deleted counter can no longer be updated:
cqlsh> use sst3 ;
cqlsh:sst3> UPDATE counter_table SET rc1 = rc1 + 2 WHERE pk = 'key' AND ck = 'ck2';
cqlsh:sst3> SELECT * from sst3.counter_table;
pk | ck | rc1 | rc2
-----+-----+-----+-----
key | ck1 | 10 | 1
(1 rows)
"
* 'projects/sstables-30/write_counters/v1' of https://github.com/argenet/scylla:
tests: Unit tests to cover writing counters in SSTables 3.x format.
sstables: Support writing counters for SSTables 3.x.
sstables: Move code writing counter value into a separate helper.
sstable test fails when running concurrently (for example, release and debug
mode) because it uses a static temporary dir in lots of tests.
Let's fix it by switching to dynamic temporary dir, which is created using
mkdtemp(). Also the sstable tests will now run in /tmp, and so it's made
much faster.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180516042044.15336-1-raphaelsc@scylladb.com>
Reclaiming memory through segment compaction is expensive. For
occupancy of 85%, in order to reclaim one free segment, we need to
compact 7 segments, by migrating 6 segments worth of data. This results
in significant amplification. Compaction involves moving objects,
which in some cases is expensive in itself as well
(See https://github.com/scylladb/scylla/issues/3247).
This patch reduces amount of segment compactions in favor of doing
more eviction. It especially helps workloads in which LRU order
matches allocation order, in which case there will be no segment
compaction, and just eviction.
In perf_row_cache_update test case for large partition with lots of
rows, which simulates appending workload, I measured that for each new
object allocated, 2 need to be migrated, before the patch. After the
patch, only 0.003 objects are migrated. This reduces run time of
cache update part by 50%.
"
The memory estimations we have when using the chunked vector
are usually slightly wrong. We can make them more accurate by
exporting the memory usage directly as a chunked_vector API.
"
* 'chunked_memory-v2' of github.com:glommer/scylla:
large_bitset: be more accurate with memory usage
chunked_vector: exports its current memory usage
We are slightly underestimating the amount of memory we use. Now that
the chunked vector can exports its internal memory usage we can use that
directly.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are times in which we would like to estimate how much memory
a chunked_vector is using. We have two strategies to do it:
1) multiply the size by the size of the elements. That is wrong, because
the chunked_vector can allocate larger chunks in anticipation of more
elements to come.
2) multiply the number of chunks by 128kB. That is also wrong, because
the chunk_vector will not always allocate the entire chunk if there are
only a few elements in it.
The best way to deal with it is to allow the chunked_vector to exports
its current memory usage.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Commit 9eb8ea8b11 installed
scylla_blocktune.py as part of preparing the rpm, but forgot
to add it to the installed file list, breaking the rpm build.
Fix by listing the file in the %files section.
Message-Id: <20180506202807.5719-1-avi@scylladb.com>
"
SSTables 3.x (format 'm') use CRC32 instead of Adler32 for calculating
checksums. This patchset introduces support for CRC32 along with Adler32
in checksummed_file_writer to be used for SSTables written in 'mc'
format.
Structures and helpers introduced for CRC32 will be later used for
calculating checksums for compressed files as well (not a part of this
patchset).
Tests: unit {release}
"
* 'projects/sstables-30/write-digest-crc/v3' of https://github.com/argenet/scylla:
tests: Add test covering checksumming SSTables 3.0 with CRC32.
sstables: Support CRC32 checksum for SSTables 3.x.
sstables: Move adler32 routines under the scope of a class.
sstables: Move checksum utils into separate header.
sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
"
The native protocol server generates mant reactor tasks that
can be easily eliminated. I measured a read workload with 100%
cache hit rate, seeing the number of tasks per request drop
from ~31 to ~27, and an increase of 3% in throughput.
"
* tag 'transport-optimize-1/v1' of https://github.com/avikivity/scylla:
transport: remove unused capture of flags variable
transport: merge response write and error handling continuations
transport: make write_repsonse() return void
transport: de-template a lambda
transport: merge memory-management and logging continuations
transport: remove gate continuation
transport: merge two response processing continuations
transport: simplify response processing continuation
transport: remove gratuitous continuation from process_request_one()
Remove implicit timeouts and replace with caller-specified timeouts.
This allows removing the ambiguity about what timeout a statement is
executed with, and allows removing cql_statement::execute_internal(),
which mostly overrode timeouts and consistency levels.
Timeout selection is now as follows:
query_processor::*_internal: infinite timeout, CL=ONE
query_processor::process(), execute(): user-specified consisistency level and timeout
All callers were adjusted to specify an infinite timeout. This can be
further adjusted later to use the "other" timeout for DCL and the
read or write timeout (as needed) for authentication in the normal
query path.
Note that infinite timeouts don't mean that the query will hang; as
soon as the failure detector decides that the node is down, RPC
responses will termiante with a failure and the query will fail.
Make the consistency level explicit in the caller in order to clarify
what is going on.
An "internal" query used to mean that it was accessing local tables,
so infinite timeouts and a consistency level of ONE were indicated,
but authentication accesses non-local tables so explicit consistency
level and timeouts are needed.
It just schedules the response, and returns immediately.
(I thought about calling it schedule_response(), but usually it will
write the response immediately, since waiting for network writes is
rare in a local network).
with_gate() generates a continuation if the protected function defers.
Avoid that by merging a gate::leave() call with another, preexisting,
continuation.
We have one coninuation transforming the result, and another shutting
down tracing. Since the first cannot defer, we can merge the two, reducing
the number of tasks processed by the reactor.
This patch fixes a bug where queries using a secondary index would, in
some cases, produce the same rows multiple times.
The problem was that the code begins by finding a list of primary keys
that match the search, and then work on the partitions containing them.
If multiple rows matched in the same partition, the partition was considered
multiple times, and the same rows were output multiple times.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180510203141.17157-1-nyh@scylladb.com>
If we're given a single reader (can be common in a low-write-rate table,
where most of the data will be in a single large sstable, or in leveled
tables) then we can avoid the overhead of the combining reader by returning
the single input.
Tests: unit (release)
Message-Id: <20180513130333.15424-1-avi@scylladb.com>
"
This patchset implements separate timeouts for range queries, and lays
the foundations for separate timeouts for other query types.
While the feature in itself is worthy, the real motivation is to have
the timeouts decided by the caller, instead of storage_proxy. This in
turn is required to disentangle each layer behaving differently
depending on whether the query is internal or not; instead, the goal
is to have each caller declare its needs in terms of consistency level
and timeouts, and have the lower layers implement its requirements
instead of making their own decisions.
Fixes#3013.
Tests: unit (release)
"
* tag '3013/v1.1' of https://github.com/avikivity/scylla:
storage_proxy: remove default_query_timeout()
storage_proxy: don't use default timeouts
query_options: augment with timeout_config
thrift: configure thrift transport and handler with a timeout_config
transport: configure native transport with a timeout_config
cql3: define and populate timeout_config_selector
timeout_config: introduce timeout configuration
Before we accept running while not in developer mode, we verify that
the I/O Scheduler is properly configured. Up until now, that meant
verifying that --max-io-requests is properly set and that the number
of I/O Queues is enough to leave at least 4 requests per I/O Queue.
Systems that move to newer versions of Scylla may continue doing that,
so we need to be backwards compatible and keep testing for that.
However, newer systems will not set that option, but pass a YAML
property file (or string) instead. So we need to make sure that
either one of those is set.
If the property file is set, I am deciding here not to test for
number of I/O queues. scylla_io_setup will usually configure that
anyway, plus we plan on soon moving to all-shards-dispatch making
that less important.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180509163737.5907-1-glauber@scylladb.com>
There is an ongoing discussion in issue 2678 about the right time to
release permits. Right now we are releasing the permit after we flush
all data for the memtable plus the SSTables accompanying components -
plus flushing them, closing them, etc.
During all that time, we are increasing virtual dirty by adding more
data to the buffers but we are not able to decrease it-- until we
release the permit we can't start flushing the next memtable. This is
much more of a concern than I/O overlapping as described in the issue.
We have a hook in the SSTable write process that is (should be) called
as soon as data is written. We should move the permit release there.
We aren't, though, calling that as early as we could. The call to the
data written hook is writing after the Index is closed, summary is
sealed, etc.
This patch fixes that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-2-glauber@scylladb.com>
Currently reserve refill allocates segments repeatedly until the
reserve threhsold is met. If single segment allocation needs to
reclaim memory, it will ask the reclaimer for one segment. The
reclaimer could make better decisions if it knew the total number of
segments we try to allocate. In particular, it would not attempt to
compact any segment until it evicts total amount of memory first,
which may reduce the total amount of segment compactions during
refill.
This patch changes refill to increase reclamation step used by
allocate_segment() so that it matches the total amount of memory we
refill.
We have conflict between scylla-libgcc72/scylla-libstdc++72 and
scylla-libgcc73/scylla-libstdc++73, need to replace *72 package with
scylla-2.2 metapackage to prevent it.
Fixes#3373
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180510081246.17928-1-syuu@scylladb.com>
"Previously, partition tombstone was not written for partitions with no
rows causing corrupted data files.
This is now fixed and covered with tests.
In addition, we now track partition tombstones while collecting encoding
statistics."
* 'projects/sstables-30/fix-partition-tombstone/v3' of https://github.com/argenet/scylla:
tests: Don't use deprecated schema constructor.
tests: Add tests to cover partitions consisting only of partition keys.
sstables: Make sure partition level tombstone is written for partitions with no rows.
memtable: Collect statistics from partition-level tombstone.
"
This is preparatory cleanup series with fixes/cleanup of miscellaneous
issues that I discovered while working on the stateful range-scans.
Since the stateful range-scans series, even without these patches, is a
20+ patches strong series I'd like to fast-track this, to ease reviewing
the former.
Most of the changes here are related to code-hygenie and effectiveness
and there is a patch that is correctness-related ("querier: check only
the end bound of ranges when matching them") and one that is related to
ease-of-use ("range: clean the deduced transformed type").
Note that altough these changes were made in the context of working on
the stateful range-scans they make sense on their own as well.
Tests: unit(release, debug)
"
* '1865/pre-range-scans-cleanup/v1' of https://github.com/denesb/scylla:
multishard_combining_reader: use optimized optional for the shard reader
Use dht::token_range alias for last/preferred replicas
storage_proxy::coordinator_query_result: merge constructors into one w/ default params
querier: check only the end bound of ranges when matching them
querier: take range and slice by value
querier: remove const params from make_compaction_state()
querier: make _range and _slice const
flat_multi_range_mutation_reader: optimize for non-plural range vectors
range: clean the deduced transformed type
"
Fixes#3420.
Tests: dtest (`auth_test.py`), unit (release)
"
* 'jhk/fix_3420/v2' of https://github.com/hakuch/scylla:
cql3: Include custom options in LIST ROLES
auth: Query custom options from the `authenticator`
auth: Add type alias for custom auth. options
"
These patches were extracted from much larger series that introduces new
in-memory representation of cells. They contain various enhanecments and
fixes that to a varying degree make sense on its own. Sending them
separately will hopefuly ease the review and merging proces of the whole
IMR effort.
Tests: unit(release).
"
* tag 'pre-imr/v1' of https://github.com/pdziepak/scylla:
tests/perf: add microbenchmarks for basic row operations
tests: simple_schema: add make_row_from_serialized_value()
row: add clear_hash()
types: move compare_unsigned() to bytes.hh
lsa: provide migrator with the object size
lsa: add free() that does not require object size
db/view/build_progress: avoid copying mutation fragment
mutation_partition: enable ADL for cell swap
types: make some collection_type_impl functions non-static
counters: drop revertability of apply()
mutable_view: add default constructor and const_iterator
tests/mutation_reader: do not apply mutations created on another shard
sstables: do not call atomic_cell::value() for dead cells
lsa: sanitize use of migrators
lsa: reuse registered migrator ids
lsa: make migrators table thread-local
The querier provides a `matches(const nonwrapping_range&)` member to
allow for checking whether a range matches that with which the querier
was originally created. The check for match is more lax than a strict
equality check as ranges are shrunk query progresses.
Because of this the above member only checked that one of the bounds of
the examined ranges matches. This is adequate as for this purpose
because, in the context of a single query, it is guaranteed that no
two read requests to the same replica will have overlapping range.
However Avi pointed out in a recent, related review, that this check can
be made a little more strict by requiring that the end-bounds of the
two ranges *always* matches, instead of allowing any of the bounds to
match.
Don't create a flat_multi_range_mutation_reader when the range vector
has 0 or 1 element. In the former case create an empty reader and in the
latter just create a reader with the mutation-source with the only range
in the vector.
wrapping_range and nonwrapping_range offer a transform() member function
which allows creating a new range by applying a transformer function to
the bounds of the current range. The type of bounds of the new range is
deduced from the return type for this transformer function. However the
return type is used as-is, with any CV or reference attached to it.
Since it doesn't make sense to create a range of references or a type
with CV qualifiers strip these off the deduced type.
An implementation of `authenticator` can support custom options for a
each role.
If, to make up an example, the authenticator supported the `region` key,
then a role would be created as follows:
CREATE ROLE jsmith WITH OPTIONS = { 'region': 'north_america' }
AND PASSWORD = 'super_secure';
LIST ROLES will now print this custom option map as an additional column
with the heading "options".
However, none of the implementations of `authenticator` in Scylla
currently support OPTIONS, so LIST ROLES will in practice, for now,
print the empty set:
role | super | login | options
-----------+-------+-------+---------
cassandra | True | True | {}
None of the `authenticator` implementations we have support custom
options, but we should support this operation to support the relevant
CQL statements.
simple_schema::make_row() is not very well suited for performance tests
of row and cell creation since it serialises the value. This patch
introduces a new function that performs only minimal actions.
compare_unsigned() is a general utility function that compares two
bytes_view byte-by-byte. There is no need to include whole type.hh in
order to make it available.
While the migration function should have enough information to obtain
the object size itself, the LSA logic needs to compute it as well.
IMR is going to make calculating object sizes more expensive, so by
providing the infromation to the migrator we can avoid some needless
operations.
It is non-trivial to get the size of an IMR object. However, the
standard allocator doesn't really need it and LSA can compute it itself
by asking the migrator.
Calling fully qualified std::swap() prohibits the cell objects from
using their own swap implementations. This patch invokes std::swap in
the usual ADL-friendly way.
The switch to the new in-memory representation will require a larger
parts of the logic be aware of the type of the values they are dealing
with. In most cases it is not a significant burden for the users.
Scylla uses shared-nothing architecture and communication between the
shards is supposed to be very restricted. Applying to a memtable
mutations created on another shard is way to complex operation to be
allowed. Using frozen mutations is a much safer option.
Having migrators dynamically registered and deregistered opens a new
class of bugs. This patch adds some additional checks in the debug mode
with the hopes of catching any misuse early.
With the introduction of the new in-memory representation we will get
type- and schema-dependent migrators. Since there is no bound how many
times they can be created and destroyed it is better to be safe and
reuse registered migrator ids.
"
SSTable 3.0 format introduces serialization header which is used in reading SSTables in that format.
This patchset implements loading of this new component of Statistics.db.
Tests: units (release)
"
* 'haaawk/sstables3/load_serialization_header_v2' of ssh://github.com/scylladb/seastar-dev:
Load serialization_header from statistics
Add parse for disk_array_vint_size
Add helpers to read/parse vints
Add signed_vint::serialized_size_from_first_byte
Add sstable::get_serialization_header
Move random_access_reader to separate header
250ms is too high of a period for memtable controller. Since memtable
flushes are relatively efficient, specially in comparison to
compactions, if the shares are high we can flush a lot of data down with
the high shares - so in the next adjustment period our shares will be
minuscule and we won't flush much at all.
This leads to oscillating behavior that is mitigated by adjusting
faster.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-3-glauber@scylladb.com>
"Fixes a bug in partition_snapshot::merge_partition_versions(), which would not
attempt merging if the snapshot is attached to the latest version (in which
case _version is nullptr and _entry is != nullptr). This would cause
partition_version objects to accumulate if there was an older snapshot and it
went away before the latest snapshot. Versions will be removed when the whole
entry goes away (flush or eviction).
May cause performance problems.
Fixes #3402."
* 'tgrabiec/fix-merge_partition_versions' of github.com:tgrabiec/scylla:
mvcc: Test version merging when snapshots go away
anchorless_list: Make ranges conform to SinglePassRange
anchorless_list: Drop deprecated use of std::iterator
mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
The newer version of iotune, recently merged to Seastar, accepts
a new parameter that tells us where should we store the properties
about the disk.
We are already generating that properties file for the AMI case.
Let's also pass that parameter when calling iotune.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180507175757.9144-1-glauber@scylladb.com>
Starting from 2018.1 and 2.2 there was a change in the repository path.
It was made to support multiple product (like manager and place the
enterprise in a different path).
As a result, the regular expression that look for the repository fail.
This patch change the way the path is searched, both rpm and debian
varations are combined and both options of the repository path are
unified.
See scylladb/scylla-enterprise#527
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180429151926.20431-1-amnon@scylladb.com>
"
In SSTables 3.0, the base and increment fields have been swapped in
Bloom filters to reduce collisions (see CASSANDRA-8413). This affects
the resulting values written to Filter.db.
This patchset adds support for reading/writing Filter.db in the format
corresponding to the version of SSTables.
Tests: unit {release}
Filter.db files have been generated using Cassandra 3.11 with same data
as in unit tests and are validated to match those generated by Scylla.
"
* 'projects/sstables-30/write-filter/v1-2' of https://github.com/argenet/scylla:
Fix mistakes and typos in comments (minor clean-up)
Check Filter.db in SSTables 3.x write tests.
Support Bloom filter format used in SSTables 3.0.
Remove unused overload of i_filter::get_filter().
The two hash values, base and increment, used to produce indices for
setting bits in the filter, have been swapped in SSTables 3.0.
See CASSANDRA-8413 for details.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Fixes crash in cql_tests.StorageProxyCQLTester.table_test
"avoid race condition when deleting sstable on behalf..." changed
discard_sstables behaviour to only return rp:s for sstables owned
and submitted for deletion (not all matching time stamp),
which can in some cases cause zero rp returned.
Message-Id: <20180508070003.1110-1-calle@scylladb.com>
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.
What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.
After all hints are drained the corresponding hints directory is removed.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Returning a future with an exception from end_point_manager::stop()
is practically useless because the best the caller can do is to log
it and continue as if it didn't happen because it has other things
to shut down.
Therefore in order to simplify the caller we will log the exception
if it happens and will always return a non-exceptional future.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This ensures we respect the write timeout set by the client when
applying base writes, in case a writes takes too long to acquire the
row lock for the read-before-write phase of a materialized view
update.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180507132755.8751-1-duarte@scylladb.com>
* seastar ac02df7...840002c (20):
> dpdk: protect against missing statistics
> alien: make visible in documentation
> Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
> app_template: Correct outdated comment
> apps, tests: Catch polymorphic exceptions by reference
> configure.py: Enhance detection for gcc -fvisibility=hidden bug
> reactor: add rudimentary task histogram reporting
> Revert "Merge "rewrite iotune to conform to the new ioscheduler" from Glauber"
> Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
> build: Use the same warning name for Clang and GCC
> core/rwlock: Add support for timeouts
> fs qualification: protect against EINTR
> Docker: Fix failing build due to missing GNU make
> reactor: move optional to experimental so we compile with c++14
> future: remove allocation from future::get() thread context switch
> Merge "rpc streaming" from Gleb
> reactor: put mountpoint_params in seastar namespace
> Tutorial: in PDF version of tutorial, better backtick typesetting
> tutorial: support, and start using, links to other sections
> tutorial: improve second half of semaphores section
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a simple and naive mechanism to ensure a base replica
doesn't overwhelm a potentially overloaded view replica by sending too
many concurrent view updates. We add a semaphore to limit to 100 the
number of outstanding view updates. We limit globally per shard, and
not per destination view replica. We also limit statically.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-2-duarte@scylladb.com>
The functions in cql_assertions.hh are very convenient, but have one
frustrating drawback: When you have many of those assertions in one
test, it's very hard to know *which* of the similar assertions failed.
The problem is that an error often looks like this:
unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/cql_assertions.cc(131): last checkpoint
Which of the many similar checks in "test_many_columns" failed? Note the
unhelpful "unknown location" and also the "last checkpoint" points to code
in cql_assertions.cc, not in the actual test, so it is useless.
The root cause of these problems is that the Boost macros use the C
preprocessor __FILE__ and __LINE__, which in actual C++ functions like
is_rows() remembers its location, instead of the caller. Fixing this will
not be simple. But this patch has a much simpler solution - fixing the
"last checkpoint". What ruins the last checkpoint is the use of BOOST_REQUIRE
inside the cql_assertions.cc is_rows() - when that succeeds, it records
the location inside cql_assertions.cc (!) as the last success.
If we just replace BOOST_REQUIRE by our own test (just like in the rest of
the cql_assertions.cc code), this code will not override the last checkpoint.
The user can see the last real successful BOOST_REQUIRE, or use
BOOST_TEST_PASSPOINT() to set his own checkpoints between different parts of
the same test.
After this patch, and with adding BOOST_TEST_PASSPOINT() calls between
different parts of my test, the failure above now looks like:
unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/secondary_index_test.cc(299): last checkpoint
The "last checkpoint" now shows me exactly where my failing check was.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501152638.26238-1-nyh@scylladb.com>
"This patchset adds ostream operators to result_message and uses them
in cql_assertions."
* tag 'result_message-print/v1.1' of https://github.com/avikivity/scylla:
tests: cql_assersions: improve error message when a row is not found
transport: add ostream support to result_message
transport: const correctness for result_message::accept()
Use utf8_type where warranted.
Fixes view_schema_test failure where the rows did not match. I don't
understand exactly why the failure happened (using the wrong type
should not cause a failure here), but the change fixes the problem.
Tests: view_schema_test (release)
Message-Id: <20180506130015.7450-1-avi@scylladb.com>
The visitor does not alter the result_message it is visiting (and
its signature indicates that) so accept() should be const-qualified
to indicate that and to allow visiting const result_message:s.
"
This patchset adds support for writing Statistics.db in the SSTables
'mc' (3.x) format. This file is essential for reading data stored in
Data.db as it contains base values used for delta encoding and types of
columns.
This patchset also fixes several bugs found in writing data and index
files as well as bugs in a statistics-related structure definition.
Tests: unit {debug, release}
All SSTables files for write unit tests are validated to be processed by
sstabledump and output is verified to show the expected data.
"
* 'projects/sstables-30/write-statistics/v1' of https://github.com/argenet/scylla:
Add test covering the composite partition key case.
Add Statistics.db files to write tests for SSTables 3.0.
Do not check rows and cells for expiration when writing them to the data file.
Fix promoted index serialization.
Fix the order of items in stats_metadata.
Fix timestamp_epoch value which was truncated on exceeding int32_t type limit.
Write serialization header to Statistics.db for SSTables 3.x.
Do not pass schema to metadata_collector::update(column_stats)
Collect metadata statistics when writing SSTables 3.0.
Call get_metadata_collector() instead of referencing sstable::_collector directly.
Fix logic of writing TTLed cells in SSTable 3.0 format.
Separate statistics for count of cells, columns and rows in column_stats.
Deserialize collection in a way that doesn't incur shared_ptr counter increment and is generally shorter.
Track both min & max values for timestamp, TTL and local deletion time in metadata_collector.
Add class for tracking both extremum values (min and max) on updates.
Mainly to check that the composite type is properly serialized when
writing serialization header to Statistics.db.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For these tests to work, all time-related values are now fixed as these
are stored in Statistics.db files.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Although this logic may be seen as a useful optimization, it hinders
unit tests writing SSTables 3.0 as those need to have fixed time-related
values to produce Statistics.db files with the same content on each run.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
There is a new field introduced in the SSTables 3.0 index file format
named 'partition_header_length' that can be used to skip over to the
first clustering row in a wide partition. This one has not been
previously written and caused malformed indices.
Updated the corresponding test to include a static row and write
multiple wide partitions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Serialization header is a new components in Statistics.db introduced in
SSTables 3.0 ('ma') format. It is essential for reading data file as it
contains the base values used for delta-encoded values (timestamps,
TTLs, local deletion times) and description of column types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
After reboot, all existing sstables are considered shared. That's a safe default.
Reader used by compaction decides to use filtering reader (filters out data that
doesn't belong to this shard) if sstable is considered shared even though it may
actually be unshared.
By avoiding filtering reader we're avoiding an extra check for each key, and that
may be meaningful for compaction of tons of small partitions and even range
reads of such. We do so by fixing sstable::_shared, which is now set properly for
existing sstables at start.
quick check using microbenchmark which extends perf_sstable with compaction mode:
before: 69407.61 +- 37.03 partitions / sec (30 runs, 1 concurrent ops)
after: 70161.09 +- 40.35 partitions / sec (30 runs, 1 concurrent ops)
Fixes#3042.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180504182158.21130-1-raphaelsc@scylladb.com>
"
This series introduces a system.large_partitions table,
used to gather information on largest partitions in the cluster.
Schema below allows easy extraction of most offending keys and removal
by sstable name, which happens when a table is compacted away.
Schema: (
keyspace_name text,
table_name text,
sstable_name text,
partition_size bigint,
key text,
compaction_time timestamp,
PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);
"
Closes#3292.
* 'large_partition_table_3' of https://github.com/psarna/scylla:
database, sstables, tests: add large_partition_handler
db: add large_partition_handler interface with implementations
docs: init system_keyspace entry with system.large_partitions
db: add system.large_partitions table
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.
Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
This commit introduces large_partition_handler class, which can be used
to take additional action when large partitions are written.
It comes with two implementations:
* NOP, used in tests, which does nothing on large partition
update/delete
* CQL TABLE, which inserts/deletes information on particular sstable
to system.large_partitions table, in order to be retrievable from
cqlsh later.
References #3292
This commit adds a system.large_partitions table, which can be used
to trace largest partitions of a cluster.
Schema: (
keyspace_name text,
table_name text,
sstable_name text,
partition_size bigint,
key text,
compaction_time timestamp,
PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);
References #3292
After removal of deletion manager, caller is now responsible for properly
submitting the deletion of a shared sstable. That's because deletion manager
was responsible for holding deletion until all owners agreed on it.
Resharding for example was changed to delete the shared sstables at the end,
but truncate wasn't changed and so race condition could happen when deleting
same sstable at more than one shard in parallel. Change the operation to only
submit a shared sstable for deletion in only one owner.
Fixes dtest migration_test.TestMigration.migrate_sstable_with_schema_change_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180503193427.24049-1-raphaelsc@scylladb.com>
A step to untie classes sstable_writer_m and sstable so that eventually
we could stop them being friends.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
SSTables 3.0 format makes a distinction between count of cells and count
of columns. In that sense, a column of a collection type counts as one
column but every atomic cell in it counts as a separate cell.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
test_multishard_combining_reader_destroyed_with_pending_create_reader
was failing because it relied on smp == 3 and thus the shard on which
the reader creation is blocked being shard-2. Since the test requires to
be run with smp >= 3 we can hardcode this shard to be 2 because if the
test runs at all we are guaranteed to have at least smp >= 3.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <38883a1f4c18ca0cd065aa13826a4f1858353289.1525328233.git.bdenes@scylladb.com>
These tests are quite complicated and require intimate knowledge of how
foreign_reader and multishard_combining_reader operates. Knowing these
two objects is still required to understand the tests but make it that
much easier by explaining how they were designed to test what they test.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8de580131a8652924de920c2bc68a98e579398ee.1525328226.git.bdenes@scylladb.com>
'shard' is a short-lived on-stack variable that gets captured by
reference by continuation that gets executed on another shard.
Fixes a race condition that leads to an heap-use-after-free.
Message-Id: <20180502150507.2776-1-pdziepak@scylladb.com>
The test_foreign_reader_destroyed_with_pending_read_ahead test currently
doesn't ensure that the objects in it's scope are destroyed in the
correct order. This is necessary as there are severeal foreign pointers
to objects that live on remote shards and use each other. Since
foreign pointers destory their managed object in the background we
cannot rely on the to reliably destroy objects in order, nor can we be
sure when the object they manage is actually destroy.
So to work around that ensure that the puppet_reader is destroyed before
the remote_control it references even has a chance of being destroyed.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <232eaa899878b03fb2a765c2916e4f05841472a3.1525269726.git.bdenes@scylladb.com>
Test for Scylla's default choice of secondary index name (we found one
small problem, see issue #3403, and left it commented out). Also test
the ability to give indices non-default names.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501153439.26619-1-nyh@scylladb.com>
Add a test that adding a secondary-index for an only partition key column
is not allowed (it would be redundant), but indexing one of several partition
key columns *is* allowed. This reproduced issue #3404, and verifies that
it was fixed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-2-nyh@scylladb.com>
Indexing an only partition key component is not allowed (because it would
be redundant), but it should be allowed to index one of several partition
key components. We had a bug in that case: the underlying materialized view
we created had the same column as both a partition key and a clustering
key, which resulted in an assertion failure. This patch fixes that.
Fixes#3404.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-1-nyh@scylladb.com>
The db/index directory contains just a few lines of code that exists
there for historical reasons. It's confusing that we have both db/index
and index/ directory related to secondary-indexing.
This patch moves what little is still in db/index/ to index/. In the
future we should probably get rid of the "secondary_index" class we had
there, but for now, let's at least not have a whole new directory for it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501101246.21143-1-nyh@scylladb.com>
Fixes a bug in partition_snapshot::merge_partition_versions(), which
would not attempt merging if the snapshot is attached to the latest
version (in which case _version is nullptr and _entry is !=
nullptr). This would cause partition_version objects to accumulate if
there was an older snapshot and it went away before the latest
snapshot. Versions will be removed when the whole entry goes away
(flush or eviction).
May have caused performance problems.
Fixes#3402.
"
Both multishard_combining_reader and foreign_reader use read-head in the
background to avoid blocking consumers. These read-aheads can be still
pending when the reader is destroyed and hence extra attention is needed
to avoid memory errors. Recent manual testing, done in the context of
testing code that is using the multishard reader, proved that these
cases were not handled correctly in the initial series introducing it
(2d126a79b).
This series introduces fixes and comprehensive tests for all problematic
scenarios:
1) multishard_combining_reader is destroyed with pending reader creation
on a remote shard.
2) foreign_reader is destroyed with pending read-ahead.
3) multishard_combining_reader is destroyed with pending read-ahead.
"
* 'multishard-reader-read-ahead-fixes/v2' of https://github.com/denesb/scylla:
test.py: add custom seastar flags for mutation_reader_test
test.py: move custom seastar flags for tests declarative
mutation_reader_test: add read-ahead related multishard reader tests
tests/mutation_reader_test: change recommented smp to 3
mutation_reader_test: fix name of existing multishard reader tests
simple_schema: add global_simple_schema
simple_schema.hh: remove unused include
multishard_combining_reader: prepare for read-ahead otliving the reader
foreign_reader: prepare for read-ahead outliving the reader
multishard_combining_reader: avoid creating the shard reader twice
multishard_combining_reader: read_ahead: don't assume reader is created
multishard_combining_reader: move read-ahead related methods
multishard_combining_reader: avoid looking up the shard reader twice
multishard_combining_reader: use optional for maybe created reader
Add tests for foreign_reader and multishard_combining_reader that check
that readers destroyed while there is pending read-head will not result
in use-after-free.
Specifically check that:
* multishard_combining_reader destroyed with pending reader creation
* foreign_reader destroyed with pending read-ahead
* multishard_combining_reader destroyed with pending read-ahead
does not result in use-after-free or SEGFAULT.
These tests try to do their best to check for correct behaviour with
various BOOST_REQUIRE* checks but they still heavily rely on ASAN to
detect any use-after-free, SEGFAULT or similar errors.
Of the test_multishard_combining_reader_reading_empty_table test.
Running this test with smp=3 instead of smp=2 helps detecting additional
read-ahead related memory problems.
Which allows a simple_schema instance to be transferred to another
shard. In fact a new simple_schema instance will be created on the
remote shard but it will use the same schema instance the the original
one.
When the multishard reader is destroyed there might be severeal pending
read-aheads running in the background. These read-aheads need their
associated reader to stay alive until after the read-ahead completes.
To solve this move the flat_mutation_reader into a struct and manage
this struct's lifetime through a shared pointer. Fibers associated with
read-aheads that might outlive the multishard reader will hold on to a
copy of the shard pointer keeping the underlying reader alive until they
complete. To avoid doing any extra work a flag is added to this state
which is set when the multishard reader is destroyed. When this flag is
set, pending continuations will return early. All this is encapsulated
in multishard_combining_reader::shard_reader the multishard reader code
itself need not be changed.
The foreign reader keeps track of ongoing read-aheads via a
foreign_ptr to the read-ahead's future on the remote shard. This pointer
is overwritten after each "remote call" to the remote reader with a
pointer to the future of the new read-ahead's future.
There are severeal problems with the current implementation:
1) There is a new read-ahead launched after each "remote call"
unconditionally, even if the remote reader is at EOS. This will start
unecessary read-ahead when the reader is already finished and may be
soon destroyed (legally) by the client.
2) The pointer to the remote read-ahead future is not set to nullptr
when a remote call is issued. Thus in the destructor, where we
attach a continuation to the read-ahead's future to extend the
reader's lifetime until after the read-ahead finishes, we migh attach
a continuation to a future that already has one and run into a failed
assert().
To fix this issues reset the read-ahead pointer to nullptr each time a
remote call is issued and don't start a new read-ahead if the remote
reader is at EOS. This way we can ensure that when the reader is
destroyed we either have a valid and non-stale read-aead future or none
at all and can reliably make a decision about whether we need to extend
the lifetime of the remote reader or not.
The multishard reader creates its shard readers on demand when they are
first attempted to be used. However at this time the reader migh already
be in the progress of being created, initiated by a previous read-ahead.
To avoid creating the shard reader twice, before creating the reader
check whether there are any read-aheads in progress. If there is, it
already created (is creating or will create) the reader and hence
synchronise with the read ahead. Synchronisation happens via a promise,
the read ahead creates a promise which will be fulfilled when the reader
is created. A concurrent create_reader() call will wait on this promise
instead of attempting to create a new reader.
Currently it is assumed that when read_ahead is called the reader is
already created. Under most circumstances this will not be true. It was
blind (bad) luck that we didn't hit this before (during testing).
To the group of methods that do not assume the reader is already
created. A patch will follow that will update read_ahead() to not assume
that the reader is created.
After a little "research" [1] it turns out my initial fears were
completely without ground, std::optional::operator->() and
std::optional::opterator*() doesn't involve an unnecessary branch and
thus there is no need to hand-roll an optional with a separate bool.
[1] http://en.cppreference.com/w/cpp/utility/optional/operator*
Determine which timeout we need to apply at prepare time. We
don't know the numerical value (since it depends on whoever is
executing the query, not just the statement type), but we know
which member of timeout_config we need, so determine and remember
that.
The mutation forwarding intermediary (src_addr) may not always know
about the schema which was used by the original coordinator. I think
this may be the cause of the "Schema version ... not found" error seen
in one of the clusters which entered some pathological state:
storage_proxy - Failed to apply mutation from 1.1.1.1#5: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 32893223-a911-3a01-ad70-df1eb2a15db1): std::runtime_error (Schema version 32893223-a911-3a01-ad70-df1eb2a15db1 not found)
Fixes#3393.
Message-Id: <1524639030-1696-1-git-send-email-tgrabiec@scylladb.com>
Confirm that issue #2991 is indeed fixed - creating a secondary index
with IF NOT EXISTS ignores an already existing index, and dropping with
IF EXISTS ignores a non-existant index.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180430071714.10154-1-nyh@scylladb.com>
The existing test_secondary_index_case_sensitive only tested the
case-sensitive case of the column being indexed, and only in some
scenarios. Further testing exposed more bugs - issue #3388, issue #3391,
issue #3401. This patch adds tests which reproduced those bugs, and now
verifies their fix.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-9-nyh@scylladb.com>
test_case_sensitivity from tests/view_schema_test.cc was well-intentioned,
aiming to test from different angles the issue of non-lowercase (quoted)
column names and their interaction with materialized views.
But unfortunately, it didn't test anything! This is because the quotation
marks were forgotten, so all the identifier in this test were folded to
lowercase, and the test didn't test non-lowercase identifiers like it
intended.
So this patch adds the missing quotes, to make this test great again.
After the patches for issues #3388 and #3391 which I sent earlier, the
test *passes* (before those patches, the fixed test did not pass -
the unfixed test trivially passed).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-8-nyh@scylladb.com>
When the secondary index code builds a "%s IS NOT NULL" clause for a
CQL statement, it needs to quote the column name if it needs to be
(not only lowercase, digits and _).
Fixes#3401.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-7-nyh@scylladb.com>
We had another case-sensitivity bug in materialized views, where if
a case-sensitive (quoted) column name was listed explicitly on "SELECT"
(instead of implicitly, e.g., in "SELECT *") the column name was
incorrectly folded to lower-case and inserts would fail.
This patch fixes the code, where a "SELECT" statement was built using
the desired column names, but column names that needed quoting were
not being quoted. The bug was in a helper function build_select_statement()
which took column name strings and failed to quote them. We clean up this
function to take column definitions instead of strings - and take care
of the quoting itself. It also needs to quote the table's name in the
select statement being built.
Fixes#3391.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-6-nyh@scylladb.com>
Before this patch, if a materialized view is defined with the restriction
IS NOT NULL on a case-sensitive (quoted) column name, inserts fail with
a "restriction 'foobar IS NOT null' unknown column foobar" error, where
foobar is the lowercased version of the case-sensitive column name.
The problem is that the code uses single_column_relation::to_string()
to convert the relation into a CQL where clause. And indeed, this method
generates a CQL expression; But it calls column_identifier::raw::to_string()
to print identifiers. This is the wrong function - it doesn't quote
identifiers that need quoting because they are not lowercase.
So this patch uses column_identifier::raw::to_cql_string() (a method we
added in the previous patch) to generate the properly quoted CQL relation.
Fixes#3388
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-5-nyh@scylladb.com>
Implement a method column_identifier::raw::to_cql_string(). Exactly like
the one without "raw", this method quotes the identifier name as needed
for CQL. We'll need this method in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-4-nyh@scylladb.com>
There is no reason for to_cql_string() and maybe_quote() to both
implement the same quoting algorithm. Use the latter to implement the
former.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-3-nyh@scylladb.com>
The utility function maybe_quote() is supposed to quote identifier names
(name of keyspace, table, or column) according to CQL rules, e.g., if the
name has any uppercase or non-alphanumeric characters, it needs to be
quoted. Unfortunatelty, it didn't quite do the right thing, so this patch
fixes that. This patch also adds a comment explaining what maybe_quote()
is supposed to do (until now, users could only guess).
Fixes#3400.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-2-nyh@scylladb.com>
In commit d674b6f672, I fixed a case-
sensitive column name bug by avoiding CQL quoting of a column name
in create_index_statement.cc when building a "targets" option string.
However, there is also matching code in target_parser.hh to unquote
that option string. So this unquoting code is no longer necessary, and
should be dropped.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-1-nyh@scylladb.com>
Different request types have different timeouts (for example,
read requests have shorter timeouts than truncate requests), and
also different request sources have different timeouts (for example,
an internal local query wants infinite timeout while a user query
has a user-defined timeout).
To allow for this, define two types: timeout_config represents the
timeout configuration for a source (e.g. user), while
timeout_config_selector represents the request type, and is used
to select a timeout within a timeout configuration. The latter is
implemented as a pointer-to-member.
Also introduce an infinite timeout configuration for internal
queries.
In the current code, if the base table has a compound partition key (i.e.,
multiple partition-key columns) searching its secondary indexes didn't work.
There is no real reason why this, it was a just a bug in preparing the
second query:
Every SI query is converted to two queries. The first queries the associated
materialized view, to find a list of primary keys. Those we need to use in a
second query, of the base table. The second query needs to list, as
restrictions, the keys found above. When a partition key is compound, its
components build one key and one restriction. But in the buggy code, we
incorrectly used each component as a separate (improperly formatted) key
and restriction, and obviously this didn't work.
This patch also adds a test that reproduces this problem and confirms its fix.
In the fixed code I also found another incorrect use of to_cql_string() (which
could break case-sensitive primary key column names) and changed it to
to_string().
Fixes#3210.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429124138.24406-1-nyh@scylladb.com>
We make multiple attempts to mark a node as alive. We do that be
sending an EchoMessage, and marking the node as alive upon receiving a
successful answer. In case there's a network partition and the nodes
can't reach each other, multiple messages may be delivered and
processed.
We can avoid processing duplicate EchoMessage replies by checking
whether we had already marked the node as alive.
Fixes#1184
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180428191942.31990-1-duarte@scylladb.com>
"
Recently many changes have landed in seastar for the I/O Scheduler. We
can now describe the I/O storage of a machine by its visible properties
like throughput and bandwidth instead of relying in an indirect
calculation.
For the instances we support, we can just measure that and start using
them right away.
A version of iotune that computes those properties is not yet ready, but
in its making I have noticed that we aren't really setting the nomerges
and scheduler properties of the disks under testing. We definitely
should, since that can influence the results. So this patchset also
starts doing that.
The commandline for iotunev2 shouldn't change much. When it is ready we
will just adjust this script once more.
"
* 'scylla_io_setup' of github.com:glommer/scylla:
scylla_io_setup: preconfigure i3 and i2 instances with new I/O scheduler properties
scylla_lib: drop support for m3 and c3 AWS instance types
io_setup: call blocktune before tuning I/O
blocktune: allow it to be called as a library.
scripts: move scylla-blocktune to scripts location
* seastar 70aecca...ac02df7 (5):
> Merge "Prefix preprocessor definitions" from Jesse
> cmake: Do not enable warnings transitively
> posix: prevent unused variable warning
> build: Adjust DPDK options to fix compilation
> io_scheduler: adjust property names
DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro
references prefixed with SEASTAR_. Some may need to become
Scylla macros.
iterator incorrectly dereferenced when timestamp resolution not
explicitly specified.
following dtests are fixed:
compaction_additional_test.CompactionAdditionalStrategyTests_with_TimeWindowCompactionStrategy.compaction_is_started_on_boot_test
compaction_additional_test.CompactionAdditionalTest.compact_data_by_time_window_test
compaction_additional_test.CompactionAdditionalTest.compaction_removes_ttld_data_by_time_windows_test
compaction_test.TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180427192545.17440-1-raphaelsc@scylladb.com>
We can use iotunev2 (or any other I/O generator) to test for the limits
of the disks for the i2 and i3 instance classes. The values I got here
are the values I got from ~5 invocations of the (yet to be upstreamed)
iotune v2, with the IOPS numbers rounded for convenience of reading.
During the execution, I verified that the disks were saturated so we
can trust these numbers even if iotunev2 is merged in a different form.
The numbers are very consistent, unlike what we usually saw with the
first version of iotune.
Previously, we were just multiplying the concurrency number by the
number of disks. Now that we have better infrastructure, we will
manually test i3.large and i3.xlarge, since their disks are smaller
and slower.
For the other i3, and all instances in the i2 family storage scales up
by adding more disks. So we can keep multiplying the characteristics of
one known disk by the number of disks and assuming perfect scaling.
Example for i3, obtained with i3.2xlarge:
read_iops = 411k
read_bandwidth = 1.9GB/s
So for i3.16xlarge, we would have read_iops = 3.28M and 15GB/s - very
close to the numbers advertised by AWS.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
m3 has 80GB SSDs in its largest form and I doubt anybody has ever
used it with Scylla.
I am also not aware of any c3 deployments. Since it is past generation,
it doesn't even show up in the default instance selector anymore.
I propose we drop AMI support for it. In practice, what that means is
that we won't auto-tune its I/O properties and people that want to use
it will have to run scylla_io_setup - like they do today with the EBS
instances.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are not configuring the disks the way we want them with respect to
scheduler and nomerges. This is an oversigh that became clear now that
I started rewriting iotune-- since I will explicitly test for that. But
since this can affect the results, it should be here all along.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch makes the functions in scylla-blocktune available as a
library for other scripts - namely scylla_io_setup.
The filename, scylla-blocktune, is not the most convenient thing to call
from python so instead of just wrapping it in the usual test for
__main__ I am just splitting the file into two.
Another option would be to patch all callers to call
scylla_blocktune.py, but because we are usually not using extensions in
scripts that are meant to be called directly I decided for the split.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
scylla-blocktune currently lives in the top level but this is mostly
historical. When time comes for us to install it, the packaging systems
will copy it to /usr/lib/scylla with the others.
So for consistency let's make sure that it also lives in the scripts
directory.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
After upgrade from 1.7 to 2.0, nodes will record a per-table schema
version which matches that on 1.7 to support the rolling upgrade. Any
later schema change (after the upgrade is done) will drop this record
from affected tables so that the per-table schema version is
recalculated. If nodes perform a schema pull (they detect schema
mismatch), then the merge will affect all tables and will wipe the
per-table schema version record from all tables, even if their schema
did not change. If then only some nodes get restarted, the restarted
nodes will load tables with the new (recalculated) per-table schema
version, while not restarted nodes will still use the 1.7 per-table
schema version. Until all nodes are restarted, writes or reads between
nodes from different groups will involve a needless exchange of schema
definition.
This will manifest in logs with repeated messages indicating schema
merge with no effect, triggered by writes:
database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
The sync will be performed if the receiving shard forgets the foreign
version, which happens if it doesn't process any request referencing
it for more than 1 second.
This may impact latency of writes and reads.
The fix is to treat schema changes which drop the 1.7 per-table schema
version marker as an alter, which will switch in-memory data
structures to use the new per-table schema version immediately,
without the need for a restart.
Fixes#3394
Tests:
- dtest: schema_test.py, schema_management_test.py
- reproduced and validated the fix with run_upgrade_tests.sh from git@github.com:tgrabiec/scylla-dtest.git
- unit (release)
Message-Id: <1524764211-12868-1-git-send-email-tgrabiec@scylladb.com>
"
This patch series introduces initial support for writing SSTables in
'mc' format (aka SSTables 3.0).
Currently, the following components are written in 3.0 format:
- Data.db
- Index.db
- Summary.db
(there were no changes to summary files format compared to ka/la)
Other SSTables components are written in the old format for now as they
still need to exist to satisfy post-flush processing.
For now, only rows are written to the data file and indexed. Range
tombstones are not supported.
Writing rows is supported in full with the only exception being counter
cells. All the other features (TTLed data, row/cell level tombstones,
collections, etc) are supported.
Unit tests rely on producing files and binary-comparing them with
'golden' copies that are produced using Cassandra 3.11. This is done to
not block until reading SSTables 3.0 format is implemented.
=======================================
Implementation notes
=======================================
Internally, sstable_writer has been refactored to support multiple
implementations that are instantiated in its constructor based on the
sstable version. Little to no code is shared among sstable_writer_v2 and
sstable_writer_v3 as we only intend to support sstable_writer_v2
alongside sstable_writer_v3 for a single release (to be able to do
rollback on rolling upgrade failure) and then plan to get rid of it
entirely and switch to always writing SSTables in the new format.
The design of sstable_writer_v3 mostly follows that of its precursors
sstable_writer(_v2) and components_writer. Some refactoring and further
code rearrangements are expected in the future but the main code is
there.
"
* 'projects/sstables-30/write-rows/v2' of https://github.com/argenet/scylla:
Add tests for writing data and index files in SSTables 3.0 ('mc') format.
Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
Add missing enum values to bound_kind.
Add building blocks for writing data in SSTables 3.0 format.
Refactor sstable_writer to support various internal implementations.
Add is_fixed_length() to data types.
Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
bound_kind::clustering, bound_kind::excl_end_incl_start and
bound_kind::incl_end_excl_start are used during SSTables 3.0 writing.
bound_kind::static_clustering is not used yet but added for completeness
and parity with the Origin.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For any given CQL data type, this member returns whether its values are
of fixed or variable length. This is used by SSTables 3.0 format to only
store the length value for variable-length cells.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
This patchset prepares everything for support of both 2.x and 3.x formats and implements reading from sstable 3.x
very simple table with just partition keys.
Tests: units (release)
"
* 'haaawk/sstables3/read_only_partitions_v4' of ssh://github.com/scylladb/seastar-dev: (22 commits)
Test for reading sstable in MC format with no columns
Use new mp_row_consumer_m and data_consume_rows_context_m
Introduce mp_row_consumer_m
Rename mp_row_consumer to mp_row_consumer_k_l
Introduce consumer_m and data_consume_rows_context_m
Use read_short_length_bytes in RANGE_TOMBSTONE
Use read_short_length_bytes in ATOM_START
Use read_short_length_bytes in ROW_START
Add continuous_data_consumer::read_short_length_bytes
Reduce duplication with continuous_data_consumer::read_partial_int
Add test for a simple table with just partition key
Add test for reading index
Extract mp_row_consumer to separate header
Make sstable_mutation_reader independent from mp_row_consumer
Make sstable_mutation_reader a template
Make data_consume_context a template
Move data_consume_rows_context from row.cc to row.hh
Decouple sstable.hh and row.hh
Reduce visibility of sstable::data_consume_*
Move data_consume_context to separate header
...
Take DataConsumeRowsContext type as parameter.
This will allow us to implement different context
for reading 3.x files.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Parametrize it with the type of data consume rows context.
There will be different implementations used for different
sstable file formats.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It will be used as a template parameter for sstable_mutation_reader
once it's turned into a template. This means the definition has
to be accessible.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
They are used just in partition.cc, row.cc and sstables_test.cc
so it is usefull to cut their scope by moving them
to data_consume_context.hh.
This will make it much easier to turn data_consume_context into
a template.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It's used only in row.cc, partition.cc and sstables_test.cc
so it's better to reduce the dependency just to those files.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
On some build environment we may want to limit number of parallel jobs since
ninja-build runs ncpus jobs by default, it may too many since g++ eats very
huge memory.
So support --jobs <njobs> just like on rpm build script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180425205439.30053-1-syuu@scylladb.com>
"
This patchset brings in a statistics collector that tracks minimal
values for timestamps, TTLs and local deletion times for all the updates
made to a given memtable.
This statistics is later used when flushing memtables into SSTables
using 3.x ('mc') format to delta-encode corresponding values using
collected minimums as bases (that is why it is called encoding
statistics).
This patchset is sent out apart from other changes that introduce
writing SSTables 3.x to facilitate read path implementation that also
needs the encoding_stats structure.
The tests for write path implicitly cover this functionality as any rows
written to a SSTable 3.0 file make use of delta-encoding.
"
* 'projects/sstables-30/collect-encoding-statistics-v4' of https://github.com/argenet/scylla:
Collect encoding statistics for memtable updates.
Factor out min_tracker and max_tracker as common helpers.
Always pass mutation_partitions to partition_entry::apply()
We keep track of all updates and store the minimal values of timestamps,
TTLs and local deletion times across all the inserted data.
These values are written as a part of serialization_header for
Statistics.db and used for delta-encoding values when writing Data.db
file in SSTables 3.0 (mc) format.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
They will be re-used for collecting encoding statistics which is needed
to write SSTables 3.0.
Part of #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Previously it was also possible to pass a frozen_mutation to it.
Now we de-serialize frozen mutations at the calling side.
This is a pre-requisite for collecting memtable statistics needed for
writing into the SSTables 3.0 format.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
When provisioning a Scylla docker image with --developer-mode 0 (disabled)
scylla_raid_setup is not invoked. As a consequence the "data" directory is not
created and scylla_io_setup fails (steps to reproduce and error message provided
at the end).
This patch adds the same verifications present in scylla_io_setup to docker's
scyllasetup.py and creates the data directory in the case it is not present.
--
Steps to reproduce on AWS i3.2xlarge with Ubuntu 16.04:
sudo -s
apt update && apt upgrade -y && apt-get install docker.io -y
mdadm --create --verbose --force --run /dev/md0 --level=0 -c1024 --raid-devices=1 /dev/nvme0n1
mkfs.xfs /dev/md0 -f -K
mkdir /var/lib/scylla
mount -t xfs /dev/md0 /var/lib/scylla
docker run --name some-scylla \
--volume /var/lib/scylla:/var/lib/scylla \
-p 9042:9042 -p 7000:7000 -p 7001:7001 -p 7199:7199 \
-p 9160:9160 -p 9180:9180 -p 10000:10000 \
-d scylladb/scylla --overprovisioned 1 --developer-mode 0
docker logs some-scylla
running: (['/usr/lib/scylla/scylla_dev_mode_setup', '--developer-mode', '0'],)
running: (['/usr/lib/scylla/scylla_io_setup'],)
terminate called after throwing an instance of 'std::system_error'
what(): open: No such file or directory
ERROR:root:/var/lib/scylla/data did not pass validation tests, it may not be on XFS and/or has limited disk space.
This is a non-supported setup, and performance is expected to be very bad.
For better performance, placing your data on XFS-formatted directories is required.
To override this error, enable developer mode as follow:
sudo /usr/lib/scylla/scylla_dev_mode_setup --developer-mode 1
failed!
Traceback (most recent call last):
File "/docker-entrypoint.py", line 15, in <module>
setup.io()
File "/scyllasetup.py", line 34, in io
self._run(['/usr/lib/scylla/scylla_io_setup'])
File "/scyllasetup.py", line 23, in _run
subprocess.check_call(*args, **kwargs)
File "/usr/lib64/python3.4/subprocess.py", line 558, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/lib/scylla/scylla_io_setup']' returned non-zero exit status 1
ls -latr /var/lib/scylla
total 4
drwxr-xr-x 44 root root 4096 Abr 24 13:02 ..
drwxr-xr-x 2 root root 6 Abr 24 13:10 .
Signed-off-by: Moreno Garcia <moreno@scylladb.com>
Message-Id: <20180424173729.22151-1-moreno@scylladb.com>
Fixes#3187
Requires seastar "inet_address: Add constructor and conversion function
from/to IPv4"
Implements support IPv6 for CQL inet data. The actual data stored will
now vary between 4 and 16 bytes. gms::inet_address has been augumented
to interop with seastar::inet_address, though of course actually trying
to use an Ipv6 address there or in any of its tables with throw badly.
Tests assuming ipv4 changed. Storing a ipv4_address should be
transparent, as it now "widens". However, since all ipv4 is
inet_address, but not vice versa, there is no implicit overloading on
the read paths. I.e. tests and system_keyspace (where we read ip
addresses from tables explicitly) are modified to use the proper type.
Message-Id: <20180424161817.26316-1-calle@scylladb.com>
CQL normally folds identifiers such as column names to lowercase. However,
if the column name is quoted, case-sensitive column names and other strange
characters can be used. We had a bug where such columns could be indexed,
but then, when trying to use the index in a SELECT statement, it was not
found.
The existing code remembered the index's column after converting it to CQL
format (adding quotes). But such conversion was unnecessary, and wrong,
because the rest of the code works with bare strings and does not involve
actual CQL statements. So the fix avoids this mistaken conversion.
This patch also includes a test to reproduce this problem.
Fixes#3154.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424154920.15924-1-nyh@scylladb.com>
This commit fixes two closely related issues with handling
case-sensitive column names in JSON:
* according to doc, case-sensitive names should be wrapped with
additional pair of double quotes during JSON SELECT
* logic error in parse_json() prevented INSERT JSON from working
properly on case-sensitive column names
This commit is followed by updated cql_query_test, which checks
case-sensitive cases as well.
Message-Id: <82d9d5e193a656e99bc86b297c00662a6fb808a0.1524576066.git.sarna@scylladb.com>
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.
Tests: units (release)
"
* 'haaawk/sstables3/loading_v4' of ssh://github.com/scylladb/seastar-dev:
Add test for loading the whole sstable
Add test for loading statistics
Add support for 3_x stats metadata
Pass sstable version to describe_type
Pass sstable version to write methods
metadata_type: add Serialization type
Pass sstable_version_types to parse methods
Add test for reading filter
Add test for read_summary
sstables 3.x: Add test for reading TOC
sstable: Make component_map version dependent
sstable::component_type: add operator<<
Extract sstable::component_type to separete header
Remove unused sstable::get_shared_components
sstable_version_types: add mc version
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
I forgot that I also need to update test.py for the new test.
It's unfortunate that this script doesn't pick up the list of
tests automatically (perhaps with a black-list of tests we don't
want to run). I wonder if there are additional tests we are
forgetting to run.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424085911.29732-1-nyh@scylladb.com>
Move the two tests we have for the secondary indexing feature from the
huge tests/cql_query_test.cc to a new file, secondary_index_test.cc.
Having these tests in a separate file will make it easier and faster to
write more tests for this feature, and to run these tests together.
This patch doesn't change anything in the tests' code - it's just a code
move.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424084700.28816-1-nyh@scylladb.com>
Old versions of JsonCpp declare the following typedefs for internally
used aliases:
typedef long long int Int64;
typedef unsigned long long int UInt64;
In newer versions (1.8.x), those are declared as:
typedef int64_t Int64;
typedef uint64_t UInt64;
Those base types are not identical so in cases when a type has
constructors overloaded only for specific integral types (such as
Json::Value in JsonCpp or data_value in Scylla), an attempt to
pack/unpack an integer from/to a JSON object causes ambiguous calls.
Fixes#3208
Tests: unit {release}.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e9fff9f41e0f34b15afc90b5439be03e4295623e.1524556258.git.vladimir@scylladb.com>
We were feeding the total estimation partition count of an input shared
sstable to the output unshared ones.
So sstable writer thinks, *from estimation*, that each sstable created
by resharding will have the same data amount as the shared sstable they
are being created from. That's a problem because estimation is feeded to
bloom filter creation which directly influences its size.
So if we're resharding all sstables that belong to all shards, the
disk usage taken by filter components will be multiplied by the number
of shards. That becomes more of a problem with #3302.
Partition count estimation for a shard S will now be done as follow:
//
// TE, the total estimated partition count for a shard S, is defined as
// TE = Sum(i = 0...N) { Ei / Si }.
//
// where i is an input sstable that belongs to shard S,
// Ei is the estimated partition count for sstable i,
// Si is the total number of shards that own sstable i.
Fixes#2672.
Refs #3302.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com>
"
Fixes to several issues around view update generation, pertaining to
timestamp and TTL management.
Fixes#3361Fixes#3360Fixes#3140
Refs #3362
Tests: unit(release, debug), dtest(materialized_views.py)
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'materialized-views/fixes-galore/v2' of http://github.com/duarten/scylla:
mutation_partition: Clarify comment about emptiness
tests: Add view_complex_test
tests/view_schema_test: Complete test
db/view: Move cells instead of copying in add_cells_to_view()
db/view: Handle unselected base columns and corner cases
mutation_partition: Regular base column in view determines row liveness
db/view: Don't avoid read-before-write when view PK matches base
db/view: Process base updates to column unselected by its views
db/view: Consider partition tombstone when generating updates
tests/view_schema_test: Remove unneeded test
mutation_fragment: Allow querying if row is live
view_info: Add view_column() overload
view_info: Explicitly initialize base-dependent fields
cql3/alter_table_statement: Forbid dropping columns of MV base tables
This patch fixes several cases where it was disallowed to create
a materialized view with a filter ("where ..."), for no good reason.
After this patch, these cases will be allowed. Fixes#2367.
In ordinary SELECT queries, certain types of filtering which is known to
be deceptively inefficient is now allowed. For example, trying to query
a range of partition keys cannot be done without reading the entire
database (because the murmur3 tokenizer randomizes the order of partitions).
Restricting two partition key components also cannot be done without
reading excessive amount of the entire partition. So Scylla, following
Cassandra, chooses to disallow such SELECT queries, and give an error
message.
However, the same SELECT statements *should* be allowed when defining a
materialized view. In this case, the filter is just used to check an
individual row - not to search for one - so there is no performance
concern.
Unfortunately the existing code did these validations while building the
SELECT statement's "restrictions", in code shared by both uses of SELECT
(query and MV definition). It was easy to move one of the validations
to later code which runs after the restriction has already been built (and
knows if it is working for query or MV), but because of the way the
"restrictions" objects (translated from Cassandra 2's code) hide what they
contain, many of the checks are harder to perform after having built the
restrictions object. So instead, we add in strategic places in the
restriction-handling code a new "allow_filtering" flag. If restrictions
are built with allow_filtering=true, the extra performance-oriented tests
on the filtering restrictions is not done. Materialized views sets
allow_filtering=true.
The allow_filtering flag will also be useful later when we want to support
the "ALLOW FILTERING" query option which is currently not supported properly
(we have several open issues on that). However note that this patch doesn't
complete that support: I left a FIXME in the spot where we set
allow_filtering in the Materialized Views case, but in the futre also need
to set it if the user specified "ALLOWED FILTERING" in the query.
This patch also enables several unit tests written by Duarte which used to
fail because of this bug, and now pass. These tests verify that the
restrictions are now allowed and filter the view as desired; But I also
added test code to verify that the same restrictions are still forbidden,
as before, when used in ordinary SELECT queries.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180423124343.17591-1-nyh@scylladb.com>
"
Make sure install_dependencies.sh installs all the right dependencies
and that the example `configure.py` invokation can just be copy-pasted
into the terminal and will "just work".
Ref: #3208
"
* 'fix_centos_compile/v2' of https://github.com/denesb/scylla:
install_dependencies.sh: update centos package list and example
configure.py: add --with-ragel option
configure.py: add --with-antlr3
configure.py: check compiler version first
Add missing packages to `yum install` list:
* scylla-boost163-static
* scylla-python34-pyparsing20
Update the configure.py example so that it just works:
* Change g++ to 7.3
* Add --with-antlr3 pointing to antlr3 installed from scylla 3rdparty
Before checking anything else (presence of boost, its version, etc.)
check that the compiler is present and can compile and link a simple c++
program.
Before if the compiler was not set up correctly configure.py would fail
at one of the other try_compile checks, whichever came first (usually
the one checking for boost). This lead the user into chasing some
false-positive error when in fact the compiler wasn't working.
Debian 8 causes "Invalid argument" when we used AmbientCapabilities on systemd
unit file, so drop the line when we build .deb package for Debian 8.
For other distributions, keep using the feature.
Fixes#3344
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180423102041.2138-1-syuu@scylladb.com>
"
This series complements JSON support with INSERT JSON and fromJson
cql function.
INSERT JSON implementation tries hard to interfere as little as possible
with regular INSERT path. So, after being parsed, insertJsonStatement
exists as a separate statement and is handled in a special way.
Overridden add_update_for_key extracts values from JSON map and applies
them to columns.
Converting from insert_json_statement to insert_statement uses auxiliary
from_json_object methods to convert JSON-encoded types to bytes.
Then, terms are matched to appropriate column names and cells are
updated.
fromJson CQL function uses the same from_json_object helper methods,
but applies them to single arguments, not whole rows.
Existing json handling functions from json.hh and libjsoncpp were used
where possible.
Things implemented:
* expanding CQL grammar to accept INSERT JSON
* converting JSON representation of cql values to cql terms
* serving 'INSERT INTO xxx JSON yyy' clause
* tests for INSERT JSON and fromJson()
"
* 'json_ops_2' of https://github.com/psarna/scylla:
tests: add cql unit tests for INSERT JSON
cql3: add fromJson() function
cql3: add INSERT JSON parsing to CQL grammar
cql3: add support for INSERT JSON clause
cql3: decouple execute from term binding in setters
cql3: change operation::make_* functions to static
cql3: add from_json_object function to types
cql3: Make literals::NULL_VALUE public
This commit adds tests for INSERT JSON clause, which is expected
to accept JSON strings and insert appropriate values to columns
defined there.
The tests also cover fromJson function calls and inserting prepared
batch statements with INSERT JSON inside.
References #2058
This function extends JSON support with fromJson() function,
which can be used in UPDATE clause to transform JSON value
into a value with proper CQL type.
fromJson() accepts strings and may return any type, so its instances,
like toJson(), are generated during calls.
This commit also extends functions::get() with additional
'receiver' parameter. This parameter is used to extract receiver type
information neeeded to generate proper fromJson instance.
Receiver is known only during insert/update, so functions::get() also
accepts a nullptr if receiver is not known (e.g. during selection).
References #2058
This commit adds the implementation of INSERT JSON clause
which accepts JSON object as parameter and inserts appropriate
values into appropriate columns, as defined in given JSON.
Example:
INSERT INTO testme JSON '{
"id" : 77,
"name" : "Jones",
"ranking" : 8.5
}'
References #2058
This commit makes it possible to pass values to setters,
instead of having to pass cql3::term instances.
Thanks to that previously prepared terminals can be directly
used in a setter execution.
References #2058
This commit makes operation::make* functions static, because they
don't access any instance-specific data anyway. It is later needed
to decouple setter execution from binding a cql3::term.
This commit adds a 'from_json_object' method which will be used
for converting JSON representation of a value to raw bytes representing
the same value. This functionality will be needed by 'INSERT JSON'
clause implementation, which can turn these raw bytes into cql3::term.
References #2058
continuous_data_consumer_test takes an unreasonable amount of
time to run, especially in debug mode. Reduce the run time by
reducing the number of loops.
Message-Id: <20180422150938.29143-1-avi@scylladb.com>
This patch introduces view_complex_test and adds more test coverage
for materialized views.
A new file was introduced to avoid making view_schema_test slower.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.
This patch ensures that unselected columns are considered as much as
possible, even though some limitations will still exist. In
particular, we need to represent multiple timestamps (from all the
unselected columns), but have only mechanisms to record a single
timestamp.
We also have some issues when dealing with selected column, and the
way we currently delete them. Consider the following:
create table cf (p int, c int, a int, b int, primary key (p, c))
create materialized view vcf as select a, b
from cf where p is not null and c is not null
primary key (p, c)
1) update cf using timestamp 10 set a = 1 where p = 1 and c = 1
2) delete a from cf using timestamp 11 where p = 1 and c = 1
3) update cf using timestamp 1 set a = 2 where p = 1 and c = 1
After 1), the MV should include a row with row marker @ ts10,
p = 1, c = 1, a = 1. After 2), this row should be removed.
At 3), we should add a row with row marker @ ts1, p = 1, c = 1, a = 1,
with a lower timestamp. This means that the delete should not
insert a row tombstone with timestamp @ 11, as we do now but it should
just delete the view's row marker (which exists) with ts1.
Refs #3362Fixes#3140Fixes#3361
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When views contain a primary key column that is not part of the base
table primary key, that column determines whether the row is live or
not. We need to ensure that when that cell is dead, and thus the
derived row marker, either by normal deletion of by TTL, so is the
rest of the row.
This patch introduces the idea of shawdowing row marker. We map the
status of the regular base column in the view's PK to the view row's
marker. If this marker is dead, so is that cell in the base table, and
so should the view row become. To enforce that, a view row's dead
marker shadows the whole row if that view includes a base regular
column in its PK.
Fixes#3360
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. When calculating the view's row marker we need
to access those unselected columns, so we can't avoid the
read-before-write as we were doing.
Refs #3362
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. So, process base updates to columns unselected by
any of its views.
Refs #3362
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Not adding the partition tombstone to the current list of tombstones
may cause updates to be incorrectly generated.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of lazily-initializing the regular base column in the view's
PK field, explicitly initialize it. This will be used by future
patches that don't have access to the schema when wanting to obtain
that column.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.
The fact that unselected columns can keep a view row alive also
requires that users cannot drop columns of base tables with
materialized views, which this patch implements.
Refs #3362
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.
Tests: units (release)
"
* 'haaawk/sstables3/loading_v3' of ssh://github.com/scylladb/seastar-dev:
Add test for loading the whole sstable
Add test for loading statistics
Add support for 3_x stats metadata
Pass sstable version to describe_type
Pass sstable version to write methods
metadata_type: add Serialization type
Pass sstable_version_types to parse methods
Add test for reading filter
Add test for read_summary
sstables 3.x: Add test for reading TOC
sstable: Make component_map version dependent
sstable::component_type: add operator<<
Extract sstable::component_type to separete header
Remove unused sstable::get_shared_components
sstable_version_types: add mc version
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The compression_parameter constructor is called with an extra level of
parentheses. Presumably this caused a temporary object to be constructed
and then moved into the argument being initialized, but gcc 8 complains
about ambiguity.
Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121854.12314-1-avi@scylladb.com>
The token constructor is called with an extra level of parentheses. Presumably
this caused a temporary object to be constructed and then moved into the
variable being initialized, but gcc 8 complains about ambiguity.
Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121736.12136-1-avi@scylladb.com>
The parameters to the MutationFragmentConsumer concept must be concrete
types, not decltype(auto).
Reported by gcc 8.
Message-Id: <20180421110738.7574-1-avi@scylladb.com>
"
Prints info about sstables used by readers
Example:
(gdb) scylla active-sstables
sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
sstable_count=2, total_index_lists_size=0
"
* 'tgrabiec/gdb-scylla-active-sstables' of github.com:tgrabiec/scylla:
gdb: Introduce "scylla active-sstables" command
gdb: Make list_unordered_map() more general
gdb: Improve compatibility with python2.7
Prints info about sstables used by readers
Example:
(gdb) scylla active-sstables
sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
sstable_count=2, total_index_lists_size=0
1) vt.name returns None for some types, use str() instead
2) some unorderd_maps use 'false' as the second Hash_node template parameter
3) some consumers will prefer a reference to the value instead of its address
"
Enhance continuous_data_consumer to use existing vint serialization for reading
variant integers from SSTables.
Also available at:
https://github.com/scylladb/seastar-dev/commits/haaawk/sstables3/unsigned-vint-v6
Tests: units (release)
"
* 'haaawk/sstables3/unsigned-vint-v6' of ssh://github.com/scylladb/seastar-dev:
sstables: add test for continuous_data_consumer::read_unsigned_vint
buffer_input_stream: make it possible to specify chunk size
Add tests for make_limiting_data_source
Introduce make_limiting_data_source
sstables: add continuous_data_consumer::read_unsigned_vint
Cover serialized_size_from_first_byte in tests
core: add unsigned_vint::serialized_size_from_first_byte
sstables: add all dependant headers to consumer.hh
sstables: add all dependant headers to exceptions.hh
core: add #pragma once to vint-serialization.hh
When 'always_set_home' is specified on /etc/sudoers pbuilder won't read
.pbuilderrc from current user home directory, and we don't have a way to change
the behavor from sudo command parameter.
So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed,
this can work both environment which does specified always_set_home and doesn't
specified.
Fixes#3366
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com>
This method takes a data_source and returns another data_source
that returns data from the input source but in chunks of limited
size.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This method takes first byte and determins how many bytes
are used to represent an unsigned variant integer.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Before it was depending on byteorder.hh that just happend
to be included in all compilation units that were using consumer.hh
This change makes the header compile when used in new compilation units.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Before it was depending on print.hh that just happend
to be included in all compilation units that were using
exceptions.hh. This change makes the header compile
when used in new compilation units.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Since storage_proxy provides access to the entire cluster, a local shard
reference is sufficient. Adjust query_processor to store a reference to
just the local shard, rather than a seastar::sharded<storage_proxy> and
adjust callers.
This simplifies the code a little.
Message-Id: <20180415142656.25370-3-avi@scylladb.com>
The storage_proxy represents the entire cluster, so there's never a need
to access it on a remote shard; the local shard instance will contact
remote shard or remote nodes as needed.
Simplify the API by passing storage_proxy references instead of
seastar::sharded<storage_proxy> references. query_processor and
other callers are adjusted to call seastar::sharded::local() first.
Message-Id: <20180415142656.25370-2-avi@scylladb.com>
build_deb.sh relies on pbuilder picking up a ~/.pbuilderrc which we
copy from the script. According to the pbuilder manual, "~" will refer
to the root directory (since pbuilder is run via sudo). In practice
we've observed this working with "~" referring to the current user's
home directory, but also sometimes failing, while complaining
about /root/.pbuilderrc failing. When it fails, it fails to set
the correct distribution.
To be extra sure, also copy .pbuilderrc to root's home directory. This
way, whatever behavior pbuilder chooses to follow, it will have a
configuration file to read.
Message-Id: <20180410134508.9415-1-avi@scylladb.com>
The patch fixes a bug introduce by commit
089b54f2d2.
When sstable files are stored in .../upload directory
and refresh is initialised with `nodetool` then it fails
because Scyla doesn't expect .../upload to be a part of the path.
Fixes#3334.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180413132019.17779-1-daniel@scylladb.com>
This commit extends JSON support with toJson() function,
which can be used in SELECT clause to transform a single argument
to JSON form.
toJson() accepts any type including nested collection types,
so instead of being declared with concrete types,
proper toJson() instances are generated during calls.
This commit also supplements JSON CQL query tests with toJson calls.
Finally, it refactors JSON tests so they use do_with_cql_env_thread.
References #2058
Message-Id: <a7833650428e9ef590765a14e91c4d42532588f4.1523528698.git.sarna@scylladb.com>
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();
Fixes: #3355
Tests: unit(release)
Message-Id: <20180412134744.GB22593@scylladb.com>
That's blocking KairosDB users because it uses TWCS with millisecond
timestamp resolution.
Also older drivers use millisecond instead of the default microsecond.
Fixes#3152.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180411171244.19958-1-raphaelsc@scylladb.com>
In my well intentioned attempt to use fewer magic numbers in the loading
code I replaced "64" with something calculated automatically from the
type being used.
Except I did it wrong, because sizeof(uint64_t) is 8, not 64.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411155903.27665-1-glauber@scylladb.com>
"
This series introduces 'SELECT JSON' clause support for CQL.
Things implemented:
* expanding CQL grammar with JSON keyword
* converting values to JSON format
* serving 'SELECT JSON *' clauses
* tests for 'SELECT JSON'
"
* 'json_ops' of https://github.com/psarna/scylla:
tests: add cql unit tests for SELECT JSON
cql3: Add JSON token to CQL grammar
cql3: add support for SELECT JSON clause
cql3: add to_json_string function to types
This commit adds JSON keyword to CQL grammar and allows parsing
'SELECT JSON' command in CQL. Additionally, it will be useful
in implementing 'INSERT JSON(...)'.
References #2058
This commit adds the implementation of SELECT JSON clause
which returns rows in JSON format. Each returned row has a single
'[json]' column.
References #2058
"
The multishard combined reader provides a convenient
flat_mutation_reader implementation that takes care of efficiently
reading a range from all shards that own data belonging to the range.
All this happens transparently, the user of the reader need only pass a
factory function to the multishard reader which it uses to create
remote readers when needed. These remote readers will then be managed
through foreign reader which abstracts away the fact that the reader is
located on a remote shard.
Sub readers are created for the entire read range, meaning they are free
to cross shard-range limits to fill their buffer. The output of these
sub readers is merged in a round-robin manner, the same way data is
distributed among shards. The multishard reader will move to the next
shard's reader whenever it encounters a partition whose token is after
the delimiter token.
To improve throughput and latency two levels of read-ahead is employed.
One in foreign_reader, which will try to fill the remote shard reader's
buffer in the background, in parallel to processing the results on the
local shard. And one in the multishard reader itself which will
exponentially increase concurrency whenever a sub-reader's buffer
becomes empty. But only if this happened after crossing a shard
boundary. This is important because there is no point in increasing
concurrency if a single sub reader can fill the multishard readers'
buffer.
"
* 'multishard-reader/v3' of https://github.com/denesb/scylla:
Add unit tests for multishard_combined_reader
Add multishard_combined_reader
flat_mutation_reader: add peek_buffer()
Add unit tests for foreign_reader
forwardable reader: implement fast_forward_to(position_in_partition)
Add foreign_reader
flat_mutation_reader: add detach_buffer()
Asias reported in issue #3351 that a floating point exception was seen
while loading SSTables. Looking at the trace, that seems to be because
we tried to issue a modulo operation with something that was likely 0.
That field comes from the nr_bits attribute in the large bitset, and our
current code should set it to whatever we read from the Filter file -
something that has been working for ages.
The difference is that after the patch that Asias identified as culprit,
we are moving the array from which we compute the size in the same
parameter list where we are computing the size.
This works for me and passed all my tests - likely because my compiler
was doing left-to-right evaluation as I would expect it to do. But the
standard doesn't guarantee that at all, and it reads:
"Order of evaluation of the operands of almost all C++ operators
(including the order of evaluation of function arguments in a
function-call expression and the order of evaluation of the
subexpressions within any expression) is unspecified. The compiler can
evaluate operands in any order, and may choose another order when the
same expression is evaluated again."
This likely fixes the bug, but even if it doesn't we should patch it,
since we currently have something that is technically an UB.
Fixes#3351.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411144036.24748-1-glauber@scylladb.com>
The patch fixes a bug introduce by commit 089b54f2d2.
This bug exhibited when master was deployed in an attempt to populate
materialised views. The nodes restarted in the middle and they were not able
to come back.
The fix is to remember formats and versions of sstables for every generation.
Fixes: #3324.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180410083114.17315-1-daniel@scylladb.com>
This commit adds a 'to_json_string' method which will be used
for converting values to JSON strings. In several cases it's not
sufficient to use 'to_string', e.g. actual strings need to be
surrounded with double quotes.
References #2058
Some tests escaped the --overprovisioned flag, causing them to
compete over cpu 0. Add the flag to all tests.
Message-Id: <20180410181606.8341-1-avi@scylladb.com>
Takes care of reading a range from all shards that own a subrange in the
range. The read happens sequentially, reading from one shard at a time.
Under the scenes it uses combined_mutation_reader and foreign_reader,
the former providing the merging logic and the latter taking care of
transferring the output of the remote readers to the local shard.
Readers are created on-demand by a reader-selector implementation that
creates readers for yet unvisited shards as the read progresses.
The read starts with a concurrency of one, that is the reader reads from
a single shard at a time. The concurrency is exponentially increased (to
a maximum of the number of shards) when a reader's buffer is empty after
moving the next shard. This condition is important as we only wan't to
increase concurrency for sparse tables that have little data and the
reader has to move between shards often. When concurrency is > 1, the
reader issues background read-aheads to the next shards so that by the
time it needs to move to them they have the data ready.
For dense tables (where we rarely cross shards) we rely on the
foreign_reader to issue sufficient read-aheads on its own to avoid
blocking.
Allows peeking at the next mutation fragment in the buffer. As opposed
to the existing `peek()` it assumes there's at least one fragment in the
buffer. Useful for code that already ensured that the buffer is not
empty and doesn't want to introduce a continuation (via `peek()`).
Instead of throwing std::bad_function_call. Needed by the foreign_reader
unit test. Not sure how other tests didn't hit this before as the test
is using `run_mutation_source_tests()`.
Local representant of a reader located on a remote shard. Manages the
lifecycle and takes care of seamlessly transferring fragments produced
by the remote reader. Fragments are *copied* between the shards in
batches, a bufferful at a time.
To maximize throughput read-ahead is used. After each fill_buffer() or
fast_forward_to() a read-ahead (a fill_buffer() on the remote reader) is
issued. This read-ahead runs in the background and is brough back to
foreground on the next fill_buffer() or fast_forward_to() call.
Allows for detaching the internal buffer of the reader. Enables
convenient transferring of buffered fragmends in a single batch but
will force the reader to reallocate it's buffer on the next
fill_buffer() call.
Introduced for foreign_reader which favours quick transferring of the
fragments between shards in a single batch, over minimizing allocations,
which can be amortized by background read-aheads.
After change to serialize compaction on compaction weight (eff62bc61e),
LCS invariant may break because parallel compaction can start, and it's
not currently supported for LCS.
The condition is that weight is deregistered right before last sstable
for a leveled compaction is sealed, so it may happen that a new compaction
starts for the same column family meanwhile that will promote a sstable to
an overlapping token range.
That leads to strategy restoring invariant when it finds the overlapping,
and that means wasted resources.
The fix is about removing a fast path check which is incorrect now because
we release weight early and also fixing a check for ongoing compaction
which prevented compaction from starting for LCS whenever weight tracker
was not empty.
Fixes#3279.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180410034538.30486-1-raphaelsc@scylladb.com>
_lsa_managed is always 1:1 with _region, so we can remove it, saving
some space in the segment descriptor vector.
Tests: unit (release), logalloc_test (debug)
Message-Id: <20180410122606.10671-1-avi@scylladb.com>
Problem:
Start node 1 2 3
Shutdown node2
Shutdown node1 node3
Start node1 node3
Try to repalce_address for node 2
The replace operation fails with the error:
seastar - Exiting on unhandled exception: std::runtime_error
(Cannot replace_address node2 because it doesn't exist in gossip)
This is because after all nodes shutdown, the other nodes do not have the
tokens and host_id info of node2 until node2 boots up and talks to the cluster.
If node2 can not boots up for whatever reason, currently the only way to
recover node2 is to `nodetool removenode` and bootstrap node2 again. This will
change tokens in the cluster and cause more data movement than just replacing
node2.
To fix, we add the tokens and host_id gossip application state in add_saved_endpoint
during boot up.
This is pretty safe because the generation for application state added by
add_saved_endpoint is zero, if node2 actually boots, other nodes will update
with node2's version.
Before:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"generation": 0,
"is_alive": false,
"update_time": 1523344828953,
"version": 0
}
Node 2 can not be replaced.
After:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"application_state": [
{
"application_state": 12,
"value": "31284090-2557-4036-9367-7bb4ef49c35a",
"version": 2
},
{
"application_state": 13,
"value": "... a lot of tokens ...",
"version": 1
}
],
"generation": 0,
"is_alive": false,
"update_time": 1523344828953,
"version": 0
}
Node 2 can be replaced.
Tests: dtest/replace_address_test.py
Fixes: #3347
Message-Id: <117fd6649939e0505847335791be8d7a96e7d273.1523346805.git.asias@scylladb.com>
save and load functions for the large_bitset were introduced by Avi with
d590e327c0.
In that commit, Avi says:
"... providing iterator-based load() and save() methods. The methods
support partial load/save so that access to very large bitmaps can be
split over multiple tasks."
The only user of this interface is SSTables. And turns out we don't really
split the access like that. What we do instead is to create a chunked vector
and then pass its begin() method with position = 0 and let it write everything.
The problem here is that this require the chunked vector to be fully
initialized, not just reserved. If the bitmap is large enough that in itself
can take a long time without yielding (up to 16ms seen in my setup).
We can simplify things considerably by moving the large_bitset to use a
chunked vector internally: it already uses a poor man's version of it
by allocating chunks internally (it predates the chunked_vector).
By doing that, we can turn save() into a simple copy operation, and do
away with load altogether by adding a new constructor that will just
copy an existing chunked_vector.
Fixes#3341
Tests: unit (release)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180409234726.28219-1-glauber@scylladb.com>
"This patchset removes zones and replaces them with a simpler system. LSA tries
to allocate segments at higher addresses, so that we'll end up with the standard
allocator using lower addresses and LSA using higher addresses, allowing for easier
allocation from std."
* tag 'lsa-no-zones/v6' of https://github.com/avikivity/scylla:
tests: add logalloc_test for large contiguous allocations in a challenging environemnt
logalloc: limit std segment allocations in debug mode
logalloc: introduce prime_segment_pool()
logalloc: limit non-contiguous reclaims
logalloc: pre-allocate all memory as lsa on startup
tests: add random test for dynamic_bitset
dynamic_bitset: optimize for large sets
dynamic_bitset: get rid of resize()
dynamic_bitset: remove find_*_clear() variants
logalloc: reduce segment size to 128k
logalloc: get rid of the emergency reserve stack
logalloc: replace zones with segment-at-a-time alloc/free
Commit 1671d9c433 (not on any release branch)
accidentally bumped the idle memtable flush cpu shares to 100 (representing
10%), causing flushes to be too when they don't comsume too much cpu.
Fixes#3243.
Message-Id: <20180408104601.9607-1-avi@scylladb.com>
* seastar 33d8f74...2da7d46 (4):
> http routes: Add parameters to path when adding alias
> future: compile-time optimize futurize<void>::apply()
> memory: remove unneeded union 'pla'
> queue: not_empty()/not_full() should throw when called after abort
This state does not read any data and is used only to perform
action when finishing to read a primitive type.
According to comment on continuous_data_consumer::non_consuming
such states should be marked as non_consuming.
Tests: units (release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <55a5c9b76268b50312ecd044291f28dcd8179a22.1523005293.git.piotr@scylladb.com>
Test large std allocations in an evironement that has seen many persistent
std allocations interspersed with lsa allocations, causing memory fragmentation.
Address Sanitizer has a global limit on the number of allocations
(note: not number of allocations less number of frees, but cumulative
number of allocations). Running some tests in debug mode on a machine
with sufficient memory can break that limit.
Work around that limit by restricting the amount of memory the
debug mode segment_pool can allocate. It's also nicer for running
the test on a workstation.
To segregate std and lsa allocations, we prime the segment pool
during initialization so that lsa will release lower-addressed
memory to std, rather than lsa and std competing for memory at
random addresses.
However, tests often evict all of lsa memory for their own
purposes, which defeats this priming.
Extract the functionality into a new prime_segment_pool()
function for use in tests that rely on allocation segregation.
We may fail to reclaim because a region has reclaim disabled (usually because
it is in an allocating_section. Failed reclaims can cause high CPU usage
if all of the lower addresses happen to be in a reclaim-disabled region (this
is somewhat mitigated by the fact that checking for reclaim disabled is very
cheap), but worse, failing a segment reclaim can lead to reclaimed memory
being fragmented. This results in the original allocation continuing to fail.
To combat that, we limit the number of failed reclaims. If we reach the limit,
we fail the reclaim. The surrounding allocating_section will release the
reclaim_lock, and increase reserves, which will result in reclaim being
retried with all regions being reclaimable, and succeed in allocating
contiguous memory.
Since lsa tries to keep some non-lsa memory as reserve, we end up
with three blocks of memory: at low addresses, non-lsa memory that was
allocated during startup or subsequently freed by lsa; at middle addresses,
lsa; and at the top addresses, memory that lsa left alone during initial
cache population due to the reserve.
After time passes, both std and lsa will allocate from the top section,
causing a mix of lsa and non-lsa memory. Since lsa tries to free from
lower addresses, this mix will stay there forever, increasing fragmentation.
Fix that by disabling the reserve during startup and allocating all of memory
for lsa. Any further allocation will then have to be satisfied by lsa first
freeing memory from the low addresses, so we will now have just two sections
of memory: low addresses for std, and top addresses for lsa.
Note that this startup allocation does not page in lsa segments, since the
segment constructor does not touch memory.
They are no longer used, and cannot be efficiently implemenented
for large bitsets using a summary vector approach without slowing
down the find_*_set() variants, which are used.
Also remove find_previous_set() for the same reason.
Reducing the segment size reduces the time needed to compact segments,
and increases the number of segments that can be compacted (and so
the probability of finding low-occupancy segments).
128k is the size of I/O buffers and of thread stacks, so we can't
go lower than that without more significant changes.
This patch replaces the zones mechanism with something simpler: a
single segment is moved from the standard allocator to lsa and vice
versa, at a time. Fragmentation resistance is (hopefully) achieved
by having lsa prefer high addresses for lsa data, and return segments
at low address to the standard allocator. Over time, the two will move
apart.
Moving just once segment at a time reduces the latency costs of
transferring memory between free and std.
"Together with the already merged patch, we reduce the object file
from 114MB to 81MB."
* tag 'api-diet-1/v1' of https://github.com/avikivity/scylla:
api: type-erase all-column_family map_reduce variant
api: simplify 6-argument map_reduce_cf() variant
* seastar 7328d17...33d8f74 (3):
> memory: switch to buddy allocation
> tls: Ensure we always pass through semaphores on shutdown
> memory: replace placement-new in unions with member construction
See scylladb/seastar#426.
After f59f423f3c, sstable is loaded only at shards
that own it so as to reduce the sstable load overhead.
The problem is that a sstable may no longer be forwarded to a shard that needs to
be aware of its existence which would result in that sstable generation being
reallocated for a write request.
That would result in a failure as follow:
"SSTable write failed due to existence of TOC file for generation..."
This can be fixed by forwarding any sstable at load to all its owner shards
*and* the shard responsible for its generation, which is determined as follow:
s = generation % smp::count
Fixes#3273.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180405035245.30194-1-raphaelsc@scylladb.com>
"
Fixes to the view building process, discovered from field experience.
Tests: dtest(materialized_view_tests.py, smp=2)
"
* 'views/view-build-fixes/v1' of https://github.com/duarten/scylla:
db/view: Start view building after schema agreement
db/system_keyspace: scylla_views_builds_in_progress writes are user mem
db/view: Require configuration option to enable view building
Empty partition keys are not supported on normal tables - they cannot
be inserted or queried (surprisingly, the rules for composite
partition keys are different: all components are then allowed to be
empty). However, the (non-composite) partition key of a view could end
up being empty if that column is: a base table regular column, a
base table clustering key column, or a base table partition key column,
part of a composite key.
Fixes#3262
Refs CASSANDRA-14345
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-1-duarte@scylladb.com>
If a base table or view has been dropped in one node, but another
one hasn't yet learned about it, it starts the view build process
immediately on boot, possibly calculating unneeded view updates and
causing errors at the view replica, if that replica has already
processed the schema changes. We should thus wait for schema
agreement, even if the node is a seed.
Fixes#3328
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Treat writes to scylla_views_builds_in_progress as user memory, as the
number of writes is dependent on the amount of user data on views
(times the number of views, divided by the view building batch size).
Fixes#3325
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View building, enabled by default, can contain or expose issues that
prevent the node from starting. In those cases, it is necessary to
disable view building such that the node can be submitted to
maintenance operations.
Fixes#3329
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The 6-argument map_reduce_cf function is identical to the 5-argument
version, except that it applies performs an extra cast (by calling
the 6th argument's operator=()).
Simplify the code by calling the 5-argument version from the 6-argument
version.
Reduces binary size by ~10%.
map_reduce_cf() is called with varying template parameters which each
have to be compiled separately. Unifying the internals to use types based
on std::any reduced the object size by 15% (115MB->99MB) with presumably
a commensurate decrease in compile time.
A version that used "I" instead of "std::any" (and thus merged the
internals only for callers that used the same result type) delivered
a 10% decrease in object size. While std::any is less safe, in this
case it is completely encapsulated.
Message-Id: <20180402213732.432-1-avi@scylladb.com>
test_large_allocation attempts to allocate almost half of memory.
With a buddy allocator, even if more than half of memory is free,
and even if it is contiguous, it is unlikely to be available as a
single allocation because the allocator inserts boundaries at powers-
of-two addresses.
Relax the test by allocating smaller chunks (but still the same amount,
and still with challenging sizes); allocating half of memory contiguously
is not a goal.
Also use a vector instead of a deque, and reserve it, so we don't get
intervening non-lsa allocations. I'm not sure there's a problem there
but let's not depend on the allocation patterns.
Message-Id: <20180401150828.13921-1-avi@scylladb.com>
While building with -O1, I saw that the linker could not find
the vtable for named_value<log_level>. Rather than fixing up the
includes (and likely lengthening build time), fix by defining
the class as an extern template, preventing it from being
instantiated at the call site.
Message-Id: <20180401150235.13451-1-avi@scylladb.com>
-O1 complains that client_state::_remote_addr is not initialized
(and it is right). The call site is tracing, which likely won't be
invoked for internal queries, but still.
Message-Id: <20180401150410.13651-1-avi@scylladb.com>
Since bytes is used to encapsulate blobs, not strings, there's no
need for a NUL terminator. It will never be passed to a function
that expects a C string.
Message-Id: <20180401151009.14108-1-avi@scylladb.com>
* seastar a66cc34...7328d17 (5):
> sstring: add support for non-nul-terminated sstrings
> core/sharded: Make async_sharded_service dtor virtual
> reactor: pass naked pointer to submit_io
> Merge http: "Add alias support to the API" from Amnon
> systemwide_memory_barrier: use madvise(MADV_DONTNEED) instead of mprotect()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
By default, overprovisioned is not enabled on docker unless it is
explicitly set. I have come to believe that this is a mistake.
If the user is running alone in the machine, and there are no other
processes pinned anywhere - including interrupts - not running
overprovisioned is the best choice.
But everywhere else, it is not: even if a user runs 2 docker containers
in the same machine and statically partitions CPUs with --smp (but
without cpuset) the docker containers will pin themselves to the same
sets of CPU, as they are totally unaware of each other.
It is also very common, specially in some virtualized environments, for
interrupts not to be properly distributed - being particularly keen on
being delivered on CPU0, a CPU which Scylla will pin by default.
Lastly, environments like Kubernetes simply don't support pinning at the
moment.
This patch enables the overprovisioned flag if it is explicitly set -
like we did before - but also by default unless --cpuset is set.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180331142131.842-1-glauber@scylladb.com>
Unused options are not exposed as command line options and will prevent
Scylla from booting when present, although they can still be pased over
YAML, for Cassandra compatibility.
That has never been a problem, but we have been adding options to i3
(and others) that are now deprecated, but were previously marked as
Used. Systems with those options may have issues upgrading.
While this problem is common to all Unused options, the likelihood for
any other unused option to appear in the command line is near zero,
except for those two - since we put them there ourselves.
There are two ways to handle this issue:
1) Mark them as Used, and just ignore them.
2) Add them explicitly to boost program options, and then ignore them.
The second option is preferred here, because we can add them as hidden
options in program_options, meaning they won't show up in the help. We
can then just print a discrete message saying that those options are,
for now on ignored.
v2: mark set as const (Botond)
v3: rebase on top of master, identation suggested by Duarte.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180329145517.8462-1-glauber@scylladb.com>
This reverts commit 3b53f922a3. It's broken
in two ways:
1. concrete_allocating_function::allocate()'ss caller,
region_group::start_releaser() loop, will delete the object
as soon as it returns; however we scheduled some work depending
on `this` in a separate continuation (via with_scheduling_group())
2. the calling loop's termination condition depends on the work being
done immediately, not later.
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node
I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:
On node1, node3
[shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704
On node2
[shard 0] storage_service - JOINING: waiting for schema information to complete
This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.
gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();
Each force_newer_generation_unsafe increases the generation by 1.
Here is an example,
Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"generation": 0,
"is_alive": false,
"update_time": 1521778757334,
"version": 0
},
```
After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"application_state": [
{
"application_state": 0,
"value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
"version": 214
},
{
"application_state": 6,
"value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
"version": 153
}
],
"generation": 2,
"is_alive": false,
"update_time": 1521779276246,
"version": 0
},
```
In gossiper::apply_state_locally, we have this check:
```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
// assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);
}
```
to skip the gossip update.
To fix, we relax generation max difference check to allow the generation
of a removed node.
After this patch, the removed node bootstraps successfully.
Tests: dtest:update_cluster_layout_tests.py
Fixes#3331
Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
Just saw this today during a crash when creating Materialized Views.
It is still unclear why this happened. But the message says:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: scylla: sstables/sstables.cc:2973: sstables::sstable::remove_sstable_with_temp_toc(seastar::sstring, seastar::sstring, seastar::sstring, int64_t, sstables::sstable::version_types, sstables::sstable::format_types)::<lambda()>: Assertion `tmptoc == true' failed.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Aborting on shard 0.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Backtrace:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4b4c
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4df5
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4ea3
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libpthread.so.0+0x000000000000f0ff
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x00000000000355f6
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x0000000000036ce7
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e565
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e611
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000015969d0
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x0000000001596f7a
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x000000000051ca8d
I can't even guess which table caused the problem, let alone which SSTable.
That's because those asserts are the very first thing we do. We can discuss
whether or not assert is the right behaviour (usually we can't guarantee the
state is sane if that is missing, so I don't see a problem)
But it would be nice to see which SSTable we are processing before we assert.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180328160856.10717-1-glauber@scylladb.com>
"
The configuration API is part of scylla v2 configuration.
It uses the new definition capabilities of the API to dynamically create
the swagger definition for the configuration.
This mean that the swagger will contain an entry with description and
type for each of the config value.
To get the v2 of the swager file:
http://localhost:10000/v2
If using with swagger ui, change http://localhost:10000/api-doc to http://localhost:10000/v2
It takes longer to load because the file is much bigger now.
"
* 'amnon/config_api_v5' of github.com:scylladb/seastar-dev:
Explanation about the API V2
API: add the config API as part of the v2 API.
Defining the config api
The config API is created dynamically from the config. This mean that
the swagger definition file will contain the description and types based on the
configuration.
The config.json file is used by the code generator to define a path that is
used to register the handler function.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"
This series introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.
The view_builder uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.
We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the view building process.
We employ a flat_mutation_reader for each base table for which we're
building views.
We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.
We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.
Interaction with the system tables:
- When we start building a view, we add an entry to the
scylla_views_builds_in_progress system table. If the node restarts
at this point, we'll consider these newly inserted views as having
made no progress, and we'll treat them as new views;
- When we finish a build step, we update the progress of the views
that we built during this step by writing the next token to the
scylla_views_builds_in_progress table. If the node restarts here,
we'll start building the views at the token in the next_token
column.
- When we finish building a view, we mark it as completed in the
built views system table, and remove it from the in-progress system
table. Under failure, the following can happen:
* When we fail to mark the view as built, we'll redo the last
step upon node reboot;
* When we fail to delete the in-progress record, upon reboot
we'll remove this record.
A view is marked as completed only when all shards have finished
their share of the work, that is, if a view is not built, then all
shards will still have an entry in the in-progress system table;
- A view that a shard finished building, but not all other shards,
remains in the in-progress system table, with first_token ==
next_token.
Interaction with the distributed system tables:
- When we start building a view, we mark the view build as being
in-progress;
- When we finish building a view, we mark the view as being built.
Upon failure, we ensure that if the view is in the in-progress
system table, then it may not have been written to this table. We
don't load the built views from this table when starting. When
starting, the following happens:
* If the view is in the system.built_views table and not the
in-progress system table, then it will be in this one;
* If the view is in the system.built_views table and not in
this one, it will still be in the in-progress system table -
we detect this and mark it as built in this table too,
keeping the invariant;
* If the view is in this table but not in system.built_views,
then it will also be in the in-progress system table - we
don't detect this and will redo the missing step, for
simplicity.
View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.
When building view updates, we consider that everything is new and
nothing pre-existing is there (which means no tombstones will be sent
out to the paired view replicas).
Tests:
unit (debug)
dtest (materialized_view_test.py(smp=1, smp=2))
"
* 'view-building/v4' of https://github.com/duarten/scylla: (22 commits)
tests/view_build_test: Add tests for view building
tests/cql_test_env: Move eventually() to this file
tests/cql_assertions: Assert result set is not empty
tests/cql_test_env: Start the view_builder
db/view/view_builder: Allow synchronizing with the end of a build
db/view/view_builder: Actually build views
flat_mutation_reader: Make reader from mutation fragments
db/view/view_builder: React to schema changes
service/migration_listener: Add class for view notifications
db/view: Introduce view_builder
column_family: Add function to populate views
column_family: Allow synchronizing with in-progress writes
database: Compare view id instead of name in find_views()
database: Add get_views() function
db/view: Return a future when sending view updates
service/storage_service: Allow querying the view build status
db: Introduce system_distributed_keyspace
tests: Add unit test for build_progress_virtual_reader
db/system_keyspace: Add API for MV-related system tables
db/system_keyspace: Add virtual reader for MV in-progress build status
...
This is a separate file from view_schema_test because that one is
already becoming too long to run; also, having multiple test files
means they can be executed in parallel.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Intended for use by unit tests, this patch allows synchronizing with
the end of a build for a particular view.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the missing view building code to the eponymous class.
We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.
We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Builds a reader from a set of ordered mutations fragments. This is
useful for building a reader out of a subset of segments returned by a
different reader. It is equivalent to building a mutation out of the
set of mutation fragments, and calling
make_flat_mutation_reader_from_mutations, except that it doest not yet
support fast-forwarding.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The view_builder now uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.
We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the upcoming view building
code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Add a convenience base class for view notifications, which provides
a default implementation for all other types of notifications.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.
This patch introduces only the bootstrap functionality, which is
responsible for loading the data stored in the system tables and
filling the in-memory data structures with the relevant information,
to be used in subsequent patches for the actual view building. The
interaction with the system tables is as follows.
Interaction with the tables in system_keyspace:
- When we start building a view, we add an entry to the
scylla_views_builds_in_progress system table. If the node restarts
at this point, we'll consider these newly inserted views as having
made no progress, and we'll treat them as new views;
- When we finish a build step, we update the progress of the views
that we built during this step by writing the next token to the
scylla_views_builds_in_progress table. If the node restarts here,
we'll start building the views at the token in the next_token
column.
- When we finish building a view, we mark it as completed in the
built views system table, and remove it from the in-progress system
table. Under failure, the following can happen:
* When we fail to mark the view as built, we'll redo the last
step upon node reboot;
* When we fail to delete the in-progress record, upon reboot
we'll remove this record.
A view is marked as completed only when all shards have finished
their share of the work, that is, if a view is not built, then all
shards will still have an entry in the in-progress system table;
- A view that a shard finished building, but not all other shards,
remains in the in-progress system table, with first_token ==
next_token.
Interaction with the distributed system table (view_build_status):
- When we start building a view, we mark the view build as being
in-progress;
- When we finish building a view, we mark the view as being built.
Upon failure, we ensure that if the view is in the in-progress
system table, then it may not have been written to this table. We
don't load the built views from this table when starting. When
starting, the following happens:
* If the view is in the system.built_views table and not the
in-progress system table, then it will be in view_build_status;
* If the view is in the system.built_views table and not in
this one, it will still be in the in-progress system table -
we detect this and mark it as built in this table too,
keeping the invariant;
* If the view is in this table but not in system.built_views,
then it will also be in the in-progress system table - we
don't detect this and will redo the missing step, for
simplicity.
View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The populate_views() function takes a set of views to update, a
tokento select base table partitions, and the set of sstables to
query. This lays the foundation for a view building mechanism to exist,
which walks over a given base table, reads data token-by-token,
calculates view updates (in a simplified way, compared to the existing
functions that push view updates), and sends them to the paired view
replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a mechanism to class column_family through which we
can synchronize with in-progress writes. This is useful for code that,
after some modification, needs to ensure that new writes will see it
before it can proceed.
In particular, this will be used by the view building code, which needs
to wait until the in-progress writes, which may have missed that there
is now a view, is observable to the view building code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
While we now send view mutations asynchronously in the normal view
write path, other processes interested in sending view updates, such
as streaming or view building, may wish to do it synchronously.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.
A view can be absent from the result, successfully built, or
currently being built.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces a distributed system keyspace, used to hold
system tables that need to be replicated across a set of replicas
(that is, can't use the LocalStrategy).
In following patches, we will use this keyspace to hold a table
containing view building status updates for each node, used to support
range movements and a new nodetool command.
Fixes#3237
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements an API to access the MV-related system tables,
which pertain to the view building process.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Provide a virtual reader so users can query the in-progress view table
in a way compatible with Apache Cassandra.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When building a materialized view, we divide our work by shard, so we
need to register which shard did what work in the in-progress system
table. We also add the token we started at, which will enable some
optimizations in the view building code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Additional extension points.
* Allows wrapping commitlog file io (including hinted handoff).
* Allows system schema modification on boot, allowing extensions
to inject extensions into hardcoded schemas.
Note: to make commitlog file extensions work, we need to both
enforce we can be notified on segment delete, and thus need to
fix the old issue of hard ::unlink call in segment destructor.
Segment delete is therefore moved to a batch routine, run at
intervals/flush. Replay segments and hints are also deleted via
the commitlog object, ensuring an extension is notified (metadata).
Configurable listeneres are now allowed to inject configuration
object into the main config. I.e. a local object can, either
by becoming a "configurable" or manually, add references to
self-describing values that will be parsed from the scylla.yaml
file, effectively extending it.
All these wonderful abstractions courtesy of encryption of course.
But super generalized!
"
* 'calle/commitlog_ext' of github.com:scylladb/seastar-dev:
db::extensions: Allow extensions to modify (system) schemas
db::commitlog: Add commitlog/hints file io extension
db::commitlog: Do segment delete async + force replay delete go via CL
main/init: Change configurable callbacks and calls to allow adding opts
util::config_file: Add "add" config item overload
Allows extensions/config listeners to potentially augument
(system) schemas at boot time. This is only useful for schemas
who do not pass through system_schema tables.
Refs #2858
Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.
Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
Since we just keep retrying, this can cause Scylla to not shutdown for
a while.
The data will be safe in the commit log.
Note that this patch doesn't fix the issue when shutdown goes through
storage_service::drain_on_shutdown - more work is required to handle
that case.
Ref #3318.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-3-duarte@scylladb.com>
In column_family::try_flush_memtable_to_sstable, the handle_exception()
block is on the inside of the continuations to
write_memtable_to_sstable(), which, if it fails, will leave the
sstable in the compaction_backlog_tracker::_ongoing_writes map, which
will waste disk space, and that sstable will map to a dangling pointer
to a destroyed database_sstable_write_monitor, which causes a seg
fault when accessed (for example, through the backlog_controller,
which accounts the _ongoing_writes when calculating the backlog).
Fix this by increasing the scope of handle_exception().
Fixes#3315
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-2-duarte@scylladb.com>
* Don't dump output of failed tests immediately, print the output
for failed tests in the end instead.
* Fix exception printing in run_test(): don't assume passed in error
object is a `bytes` (or bytes-like) object, call the object's str
operator instead and let callers encode bytes objects instead.
* Don't assume Exception object has an `out` member, use operator str
instead to convert it to string.
* Don't print progress in run_test() directly because it results in
incomprehensible output as the executors race to print to stdout. Leave
progress report to the caller who can serialize progress prints.
* Automatically detect non-tty stdout and don't try to edit already
printed text.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7bb7e0003ded9b28710250bff851ea849bb99f7d.1522062795.git.bdenes@scylladb.com>
"
This series does not add or change any features of access-control and
roles, but addresses some bugs and finalizes the switch to roles.
"auth: Wait for schema agreement" and the patch prior help avoid false
negatives for integration tests and error messages in logs.
"auth: Remove ordering dependence" fixes an important bug in `auth` that
could leave the default superuser in a corrupted state when it is first
created.
Since roles are feature-complete (to the best of the author's knowledge
as of this writing), the final patch in the series removes any warnings
about them being unimplemented.
Tests: unit (release), dtest (PENDING)
"
* 'jhk/auth_fixes/v1' of https://github.com/hakuch/scylla:
Roles are implemented
auth: Increase delay before background tasks start
auth: Remove ordering dependence
auth: Don't warn on rescheduled task
auth: Wait for schema agreement
Single-node clusters can agree on schema
I've observed failures due to "missing" the peer nodes by about 1
second. Adding 5 second to the existing delay should cover most false
negative test results.
Fixes#3320.
If `auth::password_authenticator` also creates `system_auth.roles` and
we fix the existence check for the default superuser in
`auth::standard_role_manager` to only search for the columns that it
owns (instead of the column itself), then both modules' initialization
are independent of one another.
Fixes#3319.
Apache Cassandra also prints at the `info` level. This change prevents
tasks which we expect to be rescheduled from failing tests and scaring
users.
A good example of this importance of this change is when queries with a
quorum consistency level (for the default superuser) fail because a
quorum is not available. We will try again in this case, and this should
not cause integration tests to fail.
Some modules of `auth` create a default superuser if it does not already
exist.
The existence check is through a SELECT query with quorum consistency
level. If the schema for the applicable tables has not yet propagated to
a peer node at the time that it processes this query, then the
`storage_proxy` will print an error message to the log and the query
will be retried.
Eventually, the schema will propagate and the default superuser will be
created. However, the error message in the log causes integration tests
to fail (and is somewhat annoying).
Now, prior to querying for existing data, we wait for all gossip peers
to have the same schema version as we do.
Fixes#2852.
At some points while bootstrapping [1], new non-seed Scylla nodes wait
for schema agreement among all known endpoints in the cluster.
The check for schema agreement was in
`service::migration_manager::is_ready_for_bootstrap`. This function
would return `true` if, at the time of its invocation, the node was
aware of at least one `UP` peer (not itself) and that all `UP` peers had
the same schema version as the node.
We wish to re-use this check in the `auth` sub-system to ensure that
the schema for internal system tables used for access-control have
propagated to the entire cluster.
Unlike in `service/storage_service.cc`, where `is_ready_for_bootstrap`
was only invoked for seed nodes, we wish to wait for schema agreement
for all nodes regardless of whether or not they are seeds.
For a single-node cluster with itself as a seed,
`is_ready_for_bootstrap` would always return `false`.
We therefore change the conditions for schema agreement. Schema
agreement is now reached when there are no known peers (so the endpoint
map of the gossiper consists only of ourselves), or when there is at
least one `UP` peer and all `UP` peers have the same schema version as
us.
This change should not impact any bootstrap behavior in
`storage_service` because seed nodes do not invoke the function and
non-seed nodes wait for peer visibility before checking for schema
agreement.
Since this function is no longer checking for schema agreement only in
the context of bootstrapping non-seed nodes, we rename it to reflect its
generality.
[1] http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
I see the following error:
seastar/core/future-util.hh:597:10: note: constraints not satisfied
seastar/core/future-util.hh:597:10: note: with ‘sstables::sstable_version_types* c’
seastar/core/future-util.hh:597:10: note: with ‘sub_partitions_read::run_test_case()::<lambda(sstables::sstable::version_types)> aa’
seastar/core/future-util.hh:597:10: note: the required expression ‘seastar::futurize_apply(aa, (* c.begin()))’ would be ill-formed
seastar/core/future-util.hh:597:10: note: ‘seastar::futurize_apply(aa, (* c.begin()))’ is not implicitly convertible to ‘seastar::future<>’
The C array all_sstable_versions decayed to a pointer (see second gcc note)
and of course doesn't support std::begin().
Fix by replacing the C array with an std::array<>, which supports std::begin().
Not clear what made this break again, or why it worked before.
Message-Id: <20180325095239.12407-1-avi@scylladb.com>
"
This patchset removes unneeded object files from the test link,
reducing unnecessary links and reducing link time and executable
size.
Tests: build (release)
"
* tag 'test-link/v1' of https://github.com/avikivity/scylla:
build: link release.o into scylla and perf_fast_forward binaries only
build: don't link api/ into tests
release.o depends on the release date and git hash, and therefore changes
every time ./configure.py is executed. In turn, this causes all tests to
relink.
Improve the situation by only linking release.o into binaries that require
it.
This helps continuous integration scripts, which call configure.py
unconditionally. Developers usually won't, so they will not see significant
savings.
Tests: build (release)
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.
Fixes#3304.
"
* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test reads with no clustering ranges and no static columns
tests: simple_schema: Allow creating schema with no static column
clustering_ranges_walker: Stop after static row in case no clustering ranges
When there are no clustering ranges, stop at position which is right
after the static row instead of position which is after all clustered
rows.
This fixes an abort in sstable reader when querying a partition with
no clustering ranges (happens with counter tables) which also doesn't
have any static columns. In such case, the sstable_mutation_reader
will setup the data_consume_context such that it only covers the
static row of the partition, knowing that there is no need to ready
any clustering row. See partition.cc::advance_to_upper_bound(). Later
when we're done with reading the static row (which is absent), we will
try to skip to the first clustering range, which in this case is
missing. If clustering_ranges_walker tells us to skip to
after_all_clustering_rows(), we will hit an asser inside
continuous_data_consumer::fast_forward_to() due to attempt to skip
past the original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine, becuase we end up
at the same data file position.
Fixes#3304.
"
There are a lot of things that we should be grouping in scheduling
groups that we aren't yet. The write path is not tagged at all,
mutation_query isn't either. Some, like streaming, are used - but not in
all places where they are needed.
Tests: unit (release)
"
* 'split-scheduling-groups-v2' of github.com:glommer/scylla:
database: group statements in their own scheduling group
database: apply streaming mutations with streaming priority
logalloc: capture current scheduling group for deferring function
Add a unit test for reproducing issue #2720 (and verifying its fix)
If a user tries to create a view whose primary key is missing any of the
base table's primary key columns, the creation should fail.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-3-nyh@scylladb.com>
Changed the order to check a couple of error conditions *after* checking
for too many or missing primary key columns. This order (showing the
too many or missing key columns first) is more useful, and is the order
in Cassadra's code.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-2-nyh@scylladb.com>
A view's primary key must include all the columns of the base's primary
key. If we don't check this and fail the table's creation, we can discover
problems later on when using the table, as demonstrated in issue #2720.
We had such checking code (translated from the same code in Java) but it
had an extra "else" which caused nothing to be put in "missing_pk_columns"
so the error was never recognized.
Also, when the error does happen, we should print the column's name_as_text(),
not name() which is (surprisingly) just a number.
Fixes#2720.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-1-nyh@scylladb.com>
One of the tests created a base table with 5 primary key columns, but
put only 4 of them in the view. This is not allowed, but prior to fixing
issue #2720 this error was silently ignored. Let's fix the error instead
of relying on this silence.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180321094352.22329-1-nyh@scylladb.com>
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.
To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are flushing the streaming memtables with streaming priority, but
applying the mutations themselves is still done with normal priorities.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we call run_when_memory_available, it is entirely possible that
the caller is doing that inside a scheduling_group. If we don't defer
we will execute correctly. But if we do defer, the current code will
execute - in the future - with the default scheduling group.
This patch fixes that by capturing the caller scheduling group and
making sure the function is executed later using it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
Since f8613a8415 we have reader-caching
on replicas for single-partition queries. This caching works best when
all pages of a query are sent to the same replicas consistently and thus
they can reuse the cached readers there.
The propability-based nature of read-repair works against this as on any
given page a read-repair will be attempted or not based on probability.
This will cause hight drop-rates on the replicas used for read-repair as
the cached reader will not be reusable if the replica was skipped for
one or more pages.
To fix this make the repair-decision once, on the first page of the
query and store the decision in the paging-state. On all remaining
pages of the query use this stored decision.
Tests: unit-tests(release, debug), dtest(paging_advanced_tests.py)
Refs: #1865
"
* 'per_query_repair_decision/v2' of https://github.com/denesb/scylla:
Make the read-repair decision only once
storage_proxy: add coordinator_query_options and coordinator_query_result
Add query_read_repair_decision to paging-state
For several reasons that I cannot fit in the margin, when a view is
created, at most ONE regular column from the base table may be added
to the view's key.
This small new test verifies that if we try to add two columns, the
view creation fails.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235453.1613-1-nyh@scylladb.com>
We had a unit test, test_primary_key_is_not_null, for testing that
we correctly complain - or don't complain - on missing "IS NOT NULL"
restrictions, as expected.
However, this test missed the actual bug we had regarding IS NOT NULL
checking - see issue #2628 - because it thought a silly syntax error
which caused an exception, was the exception we expected to see :-)
So in this patch, I rewrote this test. It fixes the test's bug and
demonstrates issue #2628 (and verifies its fix), and also tests a few
more corner cases.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235000.1399-1-nyh@scylladb.com>
When creating a materialized view, the user must provide a "IS NOT NULL"
restriction for each of the created view's primary columns. If such a
restriction is missing, the view creation should fail. In #2628 we noticed
that sometimes it wasn't failing, but later updates to such table would fail,
which is a bug.
There is actually one special case where "IS NOT NULL" is optional:
It is optional on the base's partition key column (when there is just
one of these) because it is already assumed that the partition key in
its entirety can never be.
Our "IS NOT NULL" test, validate_primary_key(), had two logic errors
which caused it to miss some cases of missing "IS NOT NULL":
1. Instead of checking whether a certain column is a the base's only
partition-key column, and avoid testing IS NOT NULL just for that
specific column, the code tested whether the schema *has* such a
column, and if it did, the test was skipped for all columns.
2. When the code found the one new column in the view's primary key, it
was so happy to find it that it immediately returned, and forgot to
test the IS NOT NULL on that column :-)
Both errors are fixed by this patch.
See the next patch for a unit test.
Fixes#2628.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319233657.522-1-nyh@scylladb.com>
Right now we have yield points between partition processing guaranteed
by the fact that there are .get()s around the code, and those include
an yield point.
We have been discussing removing the implicit yield point from get and
pushing that to the caller. In that spirit, let's yield explicitly here
if needed.
It should be the responsibility of the loop that it doesn't hurt
latency, either by the fact that it is bounded by a small number of
iterations or yields. In other words, that loop should have a yield
point on every iteration (like the non-thread variant does).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180319173051.8918-1-glauber@scylladb.com>
"
These patches add support for C* 2.2 file(name) format.
Namely:
* It forces Scylla to write files in la format.
* Adds storage-service feature for them.
* cf and ks are determined from directory, not from file-name (for 2.2 format).
* Adds some other fixes to make dtest happy.
* Unit tests work with la format or with both formats.
"
* 'danfiala/filename-format-2.2-v4' of https://github.com/hagrid-the-developer/scylla:
tests/sstables: Tests use la format or iterate over both formats.
tests/sstables: Helper functions support 2.2 format directory structure.
stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
storage_service: Support la sstable storage format as a feature.
sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
sstables: Throw more detail exception for unknown item in reverse_map.
sstables/compaction: Suppress NaN in a report of a throughput.
Make the read-repair decision on the first page of a paged-query and use
it for all the remaining pages. This helps querier-cache hit-rates as
reads to nodes will be sent consistently throught the query.
"
Fixes to gossip pertaining to the shadow round.
In particular, an issue preventing a node from being marked as alive is
fixed: After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them marks
them as dead.
If the shadow round failed, after doing the feature checking against the
system tables, we were not clearing the state map and re-adding the
endpoints. This leaves the alive marker set, and prevents
real_mark_alive() from eventually being called.
Fixes#3301
"
* 'gossip/shadow-round-fixes/v3' of https://github.com/duarten/scylla:
gms/gossiper: Remove superfluous check
service/storage_service: Always re-add loaded endpoints
gms/gossiper: Check for shadow round completion before throwing
As yet more parameters and return-values are about to be added to all
storage_proxy::query_* methods we need a way that scales better than
changing the signatures every time. To this end we aggregate all
non-mandatory query parameters into `coordinator_query_options` and all
return values into `coordinator_query_result`.
This way new fields can be simply added to the respective structs while
the signatures of the methods themselves and their client code can
remain unchanged.
This new field will store the repair-decision made on the first page of
the query. This decision will be sticky to all pages of the query.
In mixed clusters the decision might not happen on the first page and it
might even change during the query as old coordinators will not store
nor respect the decision.
After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them
marks them as dead.
If the shadow round failed, after doing the feature checking against
the system tables, we were not clearing the state map and re-adding
the endpoints. This left the alive marker set, and prevented
real_mark_alive() from eventually being called.
Fixes#3301
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patchset reduces the time required to run the tests, mostly by
running them in parallel.
I measured a reduction of 3.5X on a 1s4c4t desktop (release mode).
Tests: unit (release)
* tag 'faster-tests/v2' of https://github.com/avikivity/scylla:
tests: run tests in parallel
tests: simplify timeout handling
tests: don't require crash integrity
tests: allow sharing the machine with other tests
tests: extract seastar options to a separate variable
tests: reduce memory for tests
tests: add "--" unconditionally for boost tests
tests: start cql_test_env without binding to messaging port
storage_service: allow starting gossiper without binding to messaging port
gms: allow gossiper to start_gossiping() without binding to the port
tests: close file correctly in loading_file_test
By using the overprovisioned flag, we reduce polling and pinning, so
less CPU time is wasted and the scheduler has more options to schedule
reactor threads.
Now that we have a minimum boost version, we don't need to check whether
boost requires "--" before test-specific command line arguments. Removing
the check speeds up the test a little.
This is useful in tests, which don't communicate. Binding to a port can
fail if the system is running something else.
It would be better to prevent even more of the gossiper from starting up,
but that is more difficult.
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.
This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map (in all cores).
Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.
Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
Summary has a function, memory_size(), that estimates the amount of
memory the summary takes. It is my understanding that this is called
to serve information to tooling.
First, this function is innacurate because it doesn't take into account
the tokens per each entry, just the keys. But more importantly, it has
to iterate over all keys which can be pretty expensive if the entries
list is long. We are now keeping that in a memory area, with just
pointers in the entry. So instead of iterating through the entries, we
can iterate through the memory areas, which is much cheaper.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180316120915.16809-1-glauber@scylladb.com>
When the cluster is changed (nodes added or removed), ranges of tokens
are moved between nodes. Scylla initiates a streaming process between an
old and a new owner of the range, which can take a long time. During
that streaming time, the new owner of the range is known as a "pending node"
for this range, and all updates must go to both the old owner (in case the
movement fails!) and the pending node (in case the movement succeeds).
For materialized views, because they are ordinary tables, streaming moves
all the view's data that existed before the streaming started. But we did
not send updates done to the view *during* the streaming. A dtest
demonstrates that the new node will miss some of the view update, and will
require a repair of the view tables immediately after the cluster change
ends, which is not good. To fix that, we need to send every new update
that happens during the streaming also to the "pending node". We already
did this properly for base-table updates, but not to the view updates:
Each base table replica wrote to only one paired view table replica,
and nobody wrote to the new pending node (in case where there is one,
for the particular view token involved).
In this patch, we make sure that all view updates go also to the "pending
nodes" when there are any. We do the same thing that Cassandra does, which
is - *all* base replicas write the update to the pending node(s).
Arguably, it is inefficient that all replicas send the update to the same
node. In most cases it is enough to send it from just one base replica -
the one who is slated to be the new node's pair. I opened
https://issues.apache.org/jira/browse/CASSANDRA-14262 about this idea.
But that is an optimization. The patch as-is already fixes the bug.
Fixes#3211
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180313171853.17283-1-nyh@scylladb.com>
The functional change in this series is in the last patch
("auth: Grant all permissions to object creator").
The first patch addresses `const` correctness in `auth`. This change
allowed the new code added in the last patch to be written with the
correct `const` specifiers, and also some code to be removed.
The second-to-last patch addresses error-handling in the authorizer for
unsupported operations and is a prerequisite for the last patch (since
we now always grant permissions for new database objects).
Tests: unit (release)
* 'jhk/default_permissions/v3' of https://github.com/hakuch/scylla:
auth: Grant all permissions to object creator
auth: Unify handling for unsupported errors
auth: Fix life-time issue with parameter
auth: Fix `const` correctness
"
This is an improvement on my latest series. Instead of just
dealing with the problem of destroying the Summary that I have
identified in a previous test, I have tried to find other sources
of stalls.
Some of them are on readers and would affect early processes and
operations like nodetool refresh.
Others are on writers, which can affect any SSTable being written.
Two of those stalls (on large filter, on summary read), I saw in a
synthetic benchmark where I used very small values + nodetool compact
to generate one SSTable with many keys. They were 80ms and 20ms
respectively, and now they are totally gone.
For others, I just tried to be safe (for instance, if we know
reading/writing large vectors can be costly, just always insert
preemption points in them).
With all of these patches applied, I no longer see stalls coming from
the SSTable code in those tests (although given enough time, I am sure I
can find more).
Tests: unit (release)
Fixes: #3282, Fixes#3281, Fixes#3269
"
* 'sstables-stalls-v3-updated' of github.com:glommer/scylla:
large_bitset/bloom filter: add preemption points in loops
sstables: read filter in a thread
abstract summary entry version of the token with a token view
add a token_view
sstables: rework summary entries reading
sstables: avoid calls to resize for vectors
sstables: replace potentially large for loop with do_until
summary_entry: do not store key bytes in each summary entry
tests: change tests to make summary non-copyable
chunked_vector: do not iterate to destruct trivially destructible types
SSTables that contain many keys - a common case with small partitions in
long lived nodes - can generate filters that are quite large.
I have seen stalls over 80ms when reading a filter that was the result
of a 6h write load of very small keys after nodetool compact (filter was
in the 100s of MB)
Similar care should be taken when creating the filter, as if the
estimated number of partitions is big, the resulting large_bitset can be
quite big as well.
If we treat the i_filter.hh and large_bitset.hh interfaces as truly
generic, then maybe we should have an in_thread version along with a
common version. But the bloom filter is the only user for both and even
if that changes in the future, it is still a good idea to run something
with a massive loop in a thread.
So for simplicity, I am just asserting that we are on a thread to avoid
surprises, and inserting preemption points in the loops.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Constructing filter objects can be quite expensive. We will insert some
yield points around, and that is made a lot easier if we are calling
things from a thread.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
dht::token doesn't have a trivial destructor, so destroying an array
full of those can be quite expensive. If we use the same trick as we
used for the summary - storing the token data in a stable memory
location - we can leave the entries with a trivial destructor and destroy
the chunks themselves. Those being larger, they will be more efficient
to delete.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Ideally we would like tokens to be trivially destructible, so that we
can easily dispose of giant vectors holding them. While that is hard to
do with our current infrastructure, we can introduce a token_view, which
holds a bytes_view elements instead of the real data - making it
trivially destructible.
The comparators are then changed to take a token_view, and an implicit
conversion function is provided from tokens so they get compared.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently Ubuntu 18.04 uses distribution provided g++ and boost, but it's easier
to maintain Scylla package to build with same version toolchain/libraries, so
switch to them.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521075576-12064-1-git-send-email-syuu@scylladb.com>
* 'debian-ubuntu-build-fixes-v2' of https://github.com/syuu1228/scylla:
dist/debian: build only scylla, iotune
dist/debian: switch to boost-1.65
dist/debian: switch to gcc-7.3
Like we did for generic arrays, let's move away from resize() in trying
to read summary entries and move to a reserve/push pattern.
I have tested this patch reading a summary file that a couple of MB big.
Stalls up to 20ms were seen. After applying this patch, no such stalls
are present.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
resize is considered harmful, since it will attempt to allocate memory
and initialize each element of the vector. This can cause reactor stalls
that correlates to latency peaks.
A better idiom is reserve first - so we know we will have enough memory
and won't have to move contents - and push_back/emplace_back each
individual member.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are pushing ints here, so it shouldn't be that bad in practice.
But a potentially gigantic for loop is just asking for a stall since we won't
need_preempt() it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If we store a bytes_view instead of bytes, that has a trivial destructor
and then we don't need to destroy each element individually. To do that,
we allocate the data in a couple of large arrays which can be disposed of
easily and point to it.
We still can't destroy trivially because of the token.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the summary can be copied, but in real life there is no reason
for this to be a requirement. Tests want it, so we can destroy a summary,
load another, and compare the two. We can achieve this by allowing the first
summary to be moved, and then we can still have a reference to the second.
I am about to make a change that will make the summary not copyable as a
requirement, so we need to do this first.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We get following compile error on Debian/Ubuntu with boost-1.63:
/opt/scylladb/include/boost/intrusive/pointer_plus_bits.hpp:76:48: error: '*((void*)& __tmp +136)' is used uninitialized in this function [-Werror=uninitialized]
n = pointer(uintptr_t(p) | (uintptr_t(n) & Mask));
~~~~~~~~~~~~~~^~~~~~~
This is known issue (https://github.com/boostorg/intrusive/issues/29), fixed
on boost-1.65.
Switch to boost-1.65 to fix the issue.
Fixes#3090
* seastar bcfbe0c...a66cc34 (3):
> reactor: fix sleep mode
> cpu scheduler: don't penalize first group to run
> Simple shellscript to find out which logical CPU's shards are mapped to
When a table, keyspace, or role is created, the creator now is
automatically granted all applicable permissions on the object.
This behavior is consistent with Apache Cassandra.
Fixes#3216.
Instead of some functions in `allow_all_authorizer` throwing exceptions
and others being silently pass-through, we consistently return exception
futures with `auth::unsupported_authorization_operation`. These errors
are converted to `invalid_request_exception` in the CQL error and
ignored where appropriate in the auth subsystem.
This patch came about because of an important (and obvious, in
hindsight) realization: instances of the authorizer, role manager, and
authenticator are clients for access-control state and not the state
itself. This is reflected directly in Scylla: `auth::service` is
sharded across cores and this is possible because each instance queries
and modifies the same global state.
To give more examples, the value of an instance of `std::vector<int>` is
the structure of the container and its contents. The value of `int
file_descriptor` is an identifier for state maintained elsewhere.
Having watched an excellent talk by Herb Sutter [1] and having read an
informative blog post [2], it's clear that a member function marked
`const` communicates that the observable state of the instance is not
modified.
Thus, the member functions of the role-manager, authenticator, and
authorizer clients should not be marked `const` only if the state of the
client itself is observably changed. By this principle, member functions
which do not change the state of the client, but which mutate the global
state the client is associated with (for example, by creating a role)
are marked `const`.
The `start` (and `stop`) functions of the client have the dual role of
initializing (finalizing) both the local client state and the
external state; they are not marked `const`.
[1] https://herbsutter.com/2013/01/01/video-you-dont-know-const-and-mutable/
[2] http://talesofcpp.fusionfenix.com/post-2/episode-one-to-be-or-not-to-be-const
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.