The list of all endpoints for a query can be empty if we have
replication_factor 0 or there are no live endpoints for this token.
Do not access all_replicas.front() in this case.
Fixes#5935.
Message-Id: <20200306192521.73486-2-kostja@scylladb.com>
(cherry picked from commit 9827efe554)
The function was reimplemented to solve the following issues.
The cutom implementation also improved its performance in
close to 19%
Using regex_match("[a-z][a-z0-9_]*") may cause stack overflow on long input strings
as found with the limits_test.py:TestLimits.max_key_length_test dtest.
std::regex_replace does not replace in-place so no doubling of
quotes was actually done.
Add unit test that reproduces the crash without this fix
and tests various string patterns for correctness.
Note that defining the regex with std::regex::optimize
still ended up with stack overflow.
Fixes#5671
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0329fe1fd1)
Due to a bug the entire segment is written in one huge write of 32Mb.
The idea was to split it to writes of 128K, so fix it.
Fixes#5857
Message-Id: <20200220102939.30769-1-gleb@scylladb.com>
(cherry picked from commit df2f67626b)
The get token range API can become big which can cause large allocation
and stalls.
This patch replace the implementation so it would stream the results
using the http stream capabilities instead of serialization and sending
one big buffer.
Fixes#6297
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 7c4562d532)
In bash, 'A || B && C' will be problem because when A is true, then it will be
evaluates C, since && and || have the same precedence.
To avoid the issue we need make B && C in one statement.
Fixes#5764
(cherry picked from commit b6988112b4)
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).
There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:
1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.
2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.
Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:
auto prev = std::prev(it);
if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
it->set_continuous(true);
but handle_end_of_stream() is not invoked under allocating section.
Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:
return with_linearized_managed_bytes([&] {
...
return _worker_state->alloc_section(region, [&] {
Fixes#6637.
Refs #6108.
Tests:
- unit (all)
Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e81fc1f095)
Fixes#5808
Seems some gcc:s will generate the code as sign extending. Mine does not,
but this should be more correct anyhow.
Added small stringify test to serialization_test for inet_address
(cherry picked from commit a14a28cdf4)
Consider
1. Start n1, n2 in the cluster
2. Stop n2 and delete all data for n2
3. Start n2 to replace itself with replace_address_first_boot: n2
4. Kill n2 before n2 finishes the replace operation
5. Remove replace_address_first_boot: n2 from scylla.yaml of n2
6. Delete all data for n2
7. Start n2
At step 7, n2 will be allowed to bootstrap as a new node, because the
application state of n2 in the cluster is HIBERNATE which is not
rejected in the check of is_safe_for_bootstrap. As a result, n2 will
replace n2 with a different tokens and a different host_id, as if the
old n2 node was removed from the cluster silently.
Fixes#5172
(cherry picked from commit cdcedf5eb9)
Scylla returns the wrong error code (0000 - server internal error)
in response to trying to do authentication/authorization operations
that involves a non-existing role.
This commit changes those cases to return error code 2200 (invalid
query) which is the correct one and also the one that Cassandra
returns.
Tests:
Unit tests (Dev)
All auth and auth_role dtests
(cherry picked from commit ce8cebe34801f0ef0e327a32f37442b513ffc214)
Fixes#6363.
The implementation of get_range_to_address_map has a default behaviour,
when getting an empty keypsace, it uses the first non-system keyspace
(first here is basically, just a keyspace).
The current implementation has two issues, first, it uses a reference to
a string that is held on a stack of another function. In other word,
there's a use after free that is not clear why we never hit.
The second, it calls get_non_system_keyspaces twice. Though this is not
a bug, it's redundant (get_non_system_keyspaces uses a loop, so calling
that function does have a cost).
This patch solves both issues, by chaning the implementation to hold a
string instead of a reference to a string.
Second, it stores the results from get_non_system_keyspaces and reuse
them it's more efficient and holds the returned values on the local
stack.
Fixes#6465
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 69a46d4179)
When index file is larger than 4GB, offset calculation will overflow
uint32_t and _promoted_index_end will be too small.
As a result, promoted_index_size calculation will underflow and the
rest of the page will be interpretd as a promoted index.
The partitions which are in the remainder of the index page will not
be found by single-partition queries.
Data is not lost.
Introduced in 6c5f8e0eda.
Fixes#6040
Message-Id: <20200521174822.8350-1-tgrabiec@scylladb.com>
(cherry picked from commit a6c87a7b9e)
In a recent next failure I got the following backtrace
function=function@entry=0x270360 "seastar::rpc::sink_impl<Serializer, Out>::~sink_impl() [with Serializer = netw::serializer; Out = {repair_row_on_wire_with_cmd}]") at assert.c:101
at ./seastar/include/seastar/core/shared_ptr.hh:463
at repair/row_level.cc:2059
This patch changes a few functions to use finally to make sure the sink
is always closed.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200515202803.60020-1-espindola@scylladb.com>
(cherry picked from commit 311fbe2f0a)
Ref #6414
Consider: n1, n2, n1 is the repair master, n2 is the repair follower.
=== Case 1 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
is written to sstable, r2 is not written yet, r1 belongs to
partition 1, r2 belongs to partition 2. It yields after row r1 is
written.
data: partition_start, r1
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream()
data: partition_start, r1, partition_end
5) Step 2 resumes to apply the rows.
data: partition_start, r1, partition_end, partition_end, partition_start, r2
=== Case 2 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
is written to sstable, r2 is not written yet, r1 belongs to partition
1, r2 belongs to partition 2. It yields after partition_start for r2
is written but before _partition_opened is set to true.
data: partition_start, r1, partition_end, partition_start
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream().
Since _partition_opened[node_idx] is false, partition_end is skipped,
end_of_stream is written.
data: partition_start, r1, partition_end, partition_start, end_of_stream
This causes unbalanced partition_start and partition_end in the stream
written to sstables.
To fix, serialize the write_end_of_stream and apply_rows with a semaphore.
Fixes: #6394Fixes: #6296Fixes: #6414
(cherry picked from commit b2c4d9fdbc)
When sending hints from one file, rps_set field in send_one_file_ctx
keeps track of commitlog positions of hints that are being currently
sent, or have failed to be sent. At the end of the operation, if sending
of some hints failed, we will choose position of the earliest hint that
failed to be sent, and will retry sending that file later, starting from
that position. This position is stored in _last_not_complete_rp.
Usually, this set has a bounded size, because we impose a limit of at
most 128 hints being sent concurrently. Because we do not attempt to
send any more hints after a failure is detected, rps_set should not have
more than 128 elements at a time.
Due to a bug, commitlog positions of old hints (older than
gc_grace_seconds of the destination table) were inserted into rps_set
but not removed after checking their age. This could cause rps_set to
grow very large when replaying a file with old hints.
Moreover, if the file mixed expired and non-expired hints (which could
happen if it had hints to two tables with different gc_grace_seconds),
and sending of some non-expired hints failed, then positions of expired
hints could influence calculation _last_not_complete_rp, and more hints
than necessary would be resent on the next retry.
This simple patch removes commitlog position of a hint from rps_set when
it is detected to be too old.
Fixes#6422
(cherry picked from commit 85d5c3d5ee)
Related commit: 85d5c3d
When attempting to send a hint, an exception might occur that results in
that hint being discarded (e.g. keyspace or table of the hint was
removed).
When such an exception is thrown, position of the hint will already be
stored in rps_set. We are only allowed to retain positions of hints that
failed to be sent and needed to be retried later. Dropping a hint is not
an error, therefore its position should be removed from rps_set - but
current logic does not do that.
Because of that bug, hint files with many discardable hints might cause
rps_set to grow large when the file is replayed. Furthermore, leaving
positions of such hints in rps_set might cause more hints than necessary
to be re-sent if some non-discarded hints fail to be sent.
This commit fixes the problem by removing positions of discarded hints
from rps_set.
Fixes#6433
(cherry picked from commit 0c5ac0da98)
Speculative reader has more targets that needed for CL. In case there is
a digest mismatch the repair runs between all of them, but that violates
provided CL. The patch makes it so that repair runs only between
replicas that answered (there will be CL of them).
Fixes#6123
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200402132245.GA21956@scylladb.com>
(cherry picked from commit 36a24bbb70)
On Centos 7 machine:
fstrim.timer not enabled, only unmasked due scylla_fstrim_setup on installation
When trying run scylla-fstrim service manually you get error:
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/scylla_fstrim", line 60, in <module>
main()
File "/opt/scylladb/scripts/libexec/scylla_fstrim", line 44, in main
cfg = parse_scylla_dirs_with_default(conf=args.config)
File "/opt/scylladb/scripts/scylla_util.py", line 484, in parse_scylla_dirs_with_default
if key not in y or not y[k]:
NameError: name 'k' is not defined
It caused by error in scylla_util.py
Fixes#6294.
(cherry picked from commit 068bb3a5bf)
Until now, attempts to print counter update cell would end up
calling abort() because `atomic_cell_view::value()` has no
specialized visitor for `imr::pod<int64_t>::basic_view<is_mutable>`,
i.e. counter update IMR type. Such visitor is not easy to write
if we want to intercept counters only (and not all int64_t values).
Anyway, linearized byte representation of counter cell would not
be helpful without knowing if it consists of counter shards or
counter update (delta) - and this must be known upon `deserialize`.
This commit introduces simple approach: it determines cell type on
high level (from `atomic_cell_view`) and prints counter contents by
`counter_cell_view` or `atomic_cell_view::counter_update_value()`.
Fixes#5616
(cherry picked from commit 0ea17216fe)
The intention of the code was to clear sharding metadata
chunked_vector so that it doesn't bloat memory.
The type of c is `chunked_vector*`. Assigning `{}`
clears the pointer while the intended behavior was to reset the
`chunked_vector` instance. The original instance is left unmodified
with all its reserved space.
Because of this, the previous fix had no effect because token ranges
are stored entirely inline and popping them doesn't realease memory.
Fixes#4951
Tests:
- sstable_mutation_test (dev)
- manual using scylla binary on customer data on top of 2019.1.5
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1584559892-27653-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 5fe626a887)
Row cache needs to be invalidated whenever data in sstables
changes. Cleanup removes data from sstables which doesn't belong to
the node anymore, which means cache must be invalidated on cleanup.
Currently, stale data can be returned when a node re-owns ranges which
data are still stored in the node's row cache, because cleanup didn't
invalidate the cache."
Fixes#4446.
tests:
- unit tests (dev mode)
- dtests:
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test
cleanup_test.py
(cherry picked from commit d0b6be0820)
In order to prevent users from creating a network topology
strategy instance with invalid inputs, it's not enough to use
std::stol() on the input: a string "3abc" still returns the number '3',
but will later confuse cqlsh and other drivers, when they ask for
topology strategy details.
The error message is now more human readable, since for incorrect
numeric inputs it used to return a rather cryptic message:
ServerError: stol()
This commit fixes the issue and comes with a simple test.
Fixes#3801
Tests: unit(dev)
Message-Id: <7aaae83d003738f047d28727430ca0a5cec6b9c6.1583478000.git.sarna@scylladb.com>
(cherry picked from commit 5b7a35e02b)
The "jobs" script is used to determine the amount of compilation
parallelism on a machine. It attempts to ensure each GCC process has at
least 4 GB of memory per core. However, in the worst case scenario, we
could end up having the GCC processes take up all the system memory,
forcin swapping or OOM killer to kick in. For example, on a 4 core
machine with 16 GB of memory, this worst case scenario seems easy to
trigger in practice.
Fix up the problem by keeping a 1 GB of memory reserve for other
processes and calculating parallelism based on that.
Message-Id: <20200423082753.31162-1-penberg@scylladb.com>
(cherry picked from commit 7304a795e5)
Generating timestamps for auth signatures used a non-thread-safe
::gmtime function instead of thread-safe ::gmtime_r.
Tests: unit(dev)
Fixes#6345
(cherry picked from commit fb7fa7f442)
"
This series fix hang in multishard_writer when error happens. It contains
- multishard_writer: Abort the queue attached to consumers when producer fails
- repair: Fix hang when the writer is dead
Fixes#6241
Refs: #6248
"
* asias-stream_fix_multishard_writer_hang:
repair: Fix hang when the writer is dead
mutation_writer_test: Add test_multishard_writer_producer_aborts
multishard_writer: Abort the queue attached to consumers when producer fails
(cherry picked from commit 8925e00e96)
If no keyspace is specified when taking snapshot, there will be a segfault
because keynames is unconditionally dereferenced. Let's return an error
because a keyspace must be specified when column families are specified.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200427195634.99940-1-raphaelsc@scylladb.com>
(cherry picked from commit 02e046608f)
Fixes#6336.
In commit 388b492040, which was only supposed
to move around code, we accidentally lost the line which does
_executor.local()._stats.total_operations++;
So after this commit this counter was always zero...
This patch returns the line incrementing this counter.
Arguably, this counter is not very important - a user can also calculate
this number by summing up all the counters in the scylla_alternator_operation
array (these are counters for individual types of operations). Nevertheless,
as long as we do export a "scylla_alternator_total_operations" metric,
we need to correctly calculate it and can't leave it zero :-)
Fixes#5836
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200219162820.14205-1-nyh@scylladb.com>
(cherry picked from commit b8aed18a24)
When multiple key columns (clustering or partition) are passed to
the schema constructor, all having the same column id, the expectation
is that these columns will retain the order in which they were passed to
`schema_builder::with_column()`. Currently however this is not
guaranteed as the schema constructor sort key columns by column id with
`std::sort()`, which doesn't guarantee that equally comparing elements
retain their order. This can be an issue for indexes, the schemas of
which are built independently on each node. If there is any room for
variance between for the key column order, this can result in different
nodes having incompatible schemas for the same index.
The fix is to use `std::stable_sort()` which guarantees that the order
of equally comparing elements won't change.
This is a suspected cause of #5856, although we don't have hard proof.
Fixes: #5856
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
[avi: upgraded "Refs" to "Fixes", since we saw that std::sort() becomes
unstable at 17 elements, and the failing schema had a
clustering key with 23 elements]
Message-Id: <20200417121848.1456817-1-bdenes@scylladb.com>
(cherry picked from commit a4aa753f0f)
While CQL does not allow creation of a materialized view with more than one
base regular column in the view's key, in Alternator we do allow this - both
partition and clustering key may be a base regular column. We had a bug in
the logic handling this case:
If the new base row is missing a value for *one* of the view key columns,
we shouldn't create a view row. Similarly, if the existing base row was
missing a value for *one* of the view key columns, a view row does not
exist and doesn't need to be deleted. This was done incorrectly, and made
decisions based on just one of the key columns, and the logic is now
fixed (and I think, simplified) in this patch.
With this patch, the Alternator test which previously failed because of
this problem now passes. The patch also includes new tests in the existing
C++ unit test test_view_with_two_regular_base_columns_in_key. This tests
was already supposed to be testing various cases of two-new-key-columns
updates, but missed the cases explained above. These new tests failed
badly before this patch - some of them had clean write errors, others
caused crashes. With this patch, they pass.
Fixes#6008.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200312162503.8944-1-nyh@scylladb.com>
(cherry picked from commit 635e6d887c)
Some legacy `mc` SSTables (created in Scylla 3.0) may contain incorrect
serialization headers, which don't wrap frozen UDTs nested inside collections
with the FrozenType<...> tag. When reading such SSTable,
Scylla would detect a mismatch between the schema saved in schema
tables (which correctly wraps UDTs in the FrozenType<...> tag) and the schema
from the serialization header (which doesn't have these tags).
SSTables created in Scylla versions 3.1 and above, in particular in
Scylla versions that contain this commit, create correct serialization
headers (which wrap UDTs in the FrozenType<...> tag).
This commit does two things:
1. for all SSTables created after this commit, include a new feature
flag, CorrectUDTsInCollections, presence of which implies that frozen
UDTs inside collections have the FrozenType<...> tag.
2. when reading a Scylla SSTable without the feature flag, we assume that UDTs
nested inside collections are always frozen, even if they don't have
the tag. This assumption is safe to be made, because at the time of
this commit, Scylla does not allow non-frozen (multi-cell) types inside
collections or UDTs, and because of point 1 above.
There is one edge case not covered: if we don't know whether the SSTable
comes from Scylla or from C*. In that case we won't make the assumption
described in 2. Therefore, if we get a mismatch between schema and
serialization headers of a table which we couldn't confirm to come from
Scylla, we will still reject the table. If any user encounters such an
issue (unlikely), we will have to use another solution, e.g. using a
separate tool to rewrite the SSTable.
Fixes#6130.
(cherry picked from commit 3d811e2f95)
As @tgrabiec helpfully pointed out, creating a row tombstone
for a table which does not have a clustering key in its schema
creates something that looks like an open-ended range tombstone.
That's problematic for KA/LA sstable formats, which are incapable
of writing such tombstones, so a workaround is provided
in order to allow using KA/LA in alternator.
Fixes#6035
Cherry-picked from 0a2d7addc0
Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.
n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)
One year later, user wants the upgrade n1,n2,n3 to a new version
when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.
Such unnecessary marking of node down can cause availability issues.
For example:
DC1: n1, n2
DC2: n3, n4
When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.
To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.
Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.
Fixes#5164
(cherry picked from commit 743b529c2b)
User reported an issue that after a node restart, the restarted node
is marked as DOWN by other nodes in the cluster while the node is up
and running normally.
Consier the following:
- n1, n2, n3 in the cluster
- n3 shutdown itself
- n3 send shutdown verb to n1 and n2
- n1 and n2 set n3 in SHUTDOWN status and force the heartbeat version to
INT_MAX
- n3 restarts
- n3 sends gossip shadow rounds to n1 and n2, in
storage_service::prepare_to_join,
- n3 receives response from n1, in gossiper::handle_ack_msg, since
_enabled = false and _in_shadow_round == false, n3 will apply the
application state in fiber1, filber 1 finishes faster filber 2, it
sets _in_shadow_round = false
- n3 receives response from n2, in gossiper::handle_ack_msg, since
_enabled = false and _in_shadow_round == false, n3 will apply the
application state in fiber2, filber 2 yields
- n3 finishes the shadow round and continues
- n3 resets gossip endpoint_state_map with
gossiper.reset_endpoint_state_map()
- n3 resumes fiber 2, apply application state about n3 into
endpoint_state_map, at this point endpoint_state_map contains
information including n3 itself from n2.
- n3 calls gossiper.start_gossiping(generation_number, app_states, ...)
with new generation number generated correctly in
storage_service::prepare_to_join, but in
maybe_initialize_local_state(generation_nbr), it will not set new
generation and heartbeat if the endpoint_state_map contains itself
- n3 continues with the old generation and heartbeat learned in fiber 2
- n3 continues the gossip loop, in gossiper::run,
hbs.update_heart_beat() the heartbeat is set to the number starting
from 0.
- n1 and n2 will not get update from n3 because they use the same
generation number but n1 and n2 has larger heartbeat version
- n1 and n2 will mark n3 as down even if n3 is alive.
To fix, always use the the new generation number.
Fixes: #5800
Backports: 3.0 3.1 3.2
(cherry picked from commit 62774ff882)
When qualifying columns to be fetched for filtering, we also check
if the target column is not used as an index - in which case there's
no need of fetching it. However, the check was incorrectly assuming
that any restriction is eligible for indexing, while it's currently
only true for EQ. The fix makes a more specific check and contains
many dynamic casts, but these will hopefully we gone once our
long planned "restrictions rewrite" is done.
This commit comes with a test.
Fixes#5708
Tests: unit(dev)
(cherry picked from commit 767ff59418)
SimpleStrategy creates a list of endpoints by iterating over the set of
all configured endpoints for the given token, until we reach keyspace
replication factor.
There is a trivial coding bug when we first add at least one endpoint
to the list, and then compare list size and replication factor.
If RF=0 this never yields true.
Fix by moving the RF check before at least one endpoint is added to the
list.
Cassandra never had this bug since it uses a less fancy while()
loop.
Fixes#5962
Message-Id: <20200306193729.130266-1-kostja@scylladb.com>
(cherry picked from commit ac6f64a885)
Reclaim consults the _regions vector, so we don't want it moving around while
allocating more capacity. For that we take the reclaim lock. However, that
can cause a false-positive OOM during startup:
1. all memory is allocated to LSA as part of priming (2baa16b371)
2. the _regions vector is resized from 64k to 128k, requiring a segment
to be freed (plenty are free)
3. but reclaiming_lock is taken, so we cannot reclaim anything.
To fix, resize the _regions vector outside the lock.
Fixes#6003.
Message-Id: <20200311091217.1112081-1-avi@scylladb.com>
(cherry picked from commit c020b4e5e2)
By default, `/usr/lib/rpm/find-debuginfo.sh` will temper with
the binary's build-id when stripping its debug info as it is passed
the `--build-id-seed <version>.<release>` option.
To prevent that we need to set the following macros as follows:
unset `_unique_build_ids`
set `_no_recompute_build_ids` to 1
Fixes#5881
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 25a763a187)
... when clustering key is unavailable' from Benny
This series fixes null pointer dereference seen in #5794efd7efe cql3: generate_base_key_from_index_pk; support optional index_ck
7af1f9e cql3: do_execute_base_query: generate open-ended slice when clustering key is unavailable
7fe1a9e cql3: do_execute_base_query: fixup indentation
Fixes#5794
Branches: 3.3
Test: unit(dev) secondary_indexes_test:TestSecondaryIndexes.test_truncate_base(debug)
* bhalevy/fix-5794-generate_base_key_from_index_pk:
cql3: do_execute_base_query: fixup indentation
cql3: do_execute_base_query: generate open-ended slice when clustering key is unavailable
cql3: generate_base_key_from_index_pk; support optional index_ck
(cherry picked from commit 4e95b67501)
It seems like *.service is conflicting on install time because the file
installed twice, both debian/*.service and debian/scylla-server.install.
We don't need to use *.install, so we can just drop the line.
Fixes#5640
(cherry picked from commit 29285b28e2)
Cells in CDC logs used to be created while completely neglecting
TTLs (the TTLs from `cdc = {...'ttl':600}`). This patch adds TTLs
to all cells; there are no row markers, so wee need not set TTL
there.
Fixes#5688
(cherry picked from commit 67b92c584f)
There may be other commitlog writes waiting for zeroing to complete, so
not using proper scheduling class causes priority inversion.
Fixes#5858.
Message-Id: <20200220102939.30769-2-gleb@scylladb.com>
(cherry picked from commit 6a78cc9e31)
The maybe_schedule_schema_pull waits for schema_tables_v3 to
become available. This is unsafe in case migration manager
goes away before the feature is enabled.
Fix this by subscribing on feature with feature::listener and
waiting for condition variable in maybe_schedule_schema_pull.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sleep is interrupted with the abort source, the "wait" part
is done with the existing _background_tasks gate. Also we need
to make sure the gate stays alive till the end of the function,
so make use of the async_sharded_service (migration manager is
already such).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This unregistration doesn't happen currently, but doesn't seem to
cause any problems in general, as on stop gossiper is stopped and
nothing from it hits the store_service.
However (!) if an exception pops up between the storage_service
is subscribed on gossiper and the drain_on_shutdown defer action
is set up then we _may_ get into the following situation:
- main's stuff gets unrolled back
- gossiper is not stopped (drain_on_shutdown defer is not set up)
- migration manager is stopped (with deferred action in main)
- a nitification comes from gossiper
-> storage_service::on_change might want to pull schema with
the help of local migration manager
-> assert(local_is_initialized) strikes
Fix this by registering storage_service to gossiper a bit earlier
(both are already initialized y that time) and setting up unregister
defer right afterwards.
Test: unit(dev), manual start-stop
Bug: #5628
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200130190343.25656-1-xemul@scylladb.com>
* seastar dd686552ff...f54084c08f (2):
> reactor: fallback to epoll backend when fs.aio-max-nr is too small
> util: move read_sys_file_as() from iotune to seastar header, rename read_first_line_as()
Fixes#5638.
This patch fixes a bug that appears because of an incorrect interaction
between counters and hinted handoff.
When a counter is updated on the leader, it sends mutations to other
replicas that contain all counter shards from the leader. If consistency
level is achieved but some replicas are unavailable, a hint with
mutation containing counter shards is stored.
When a hint's destination node is no longer its replica, it is attempted
to be sent to all its current replicas. Previously,
storage_proxy::mutate was used for that purpose. It was incorrect
because that function treats mutations for counter tables as mutations
containing only a delta (by how much to increase/decrease the counter).
These two types of mutations have different serialization format, so in
this case a "shards" mutation is reinterpreted as "delta" mutation,
which can cause data corruption to occur.
This patch backports `storage_proxy::mutate_hint_from_scratch`
function, which bypasses special handling of counter mutations and
treats them as regular mutations - which is the correct behavior for
"shards" mutations.
Refs #5833.
Backports: 3.1, 3.2, 3.3
Tests: unit(dev)
(cherry picked from commit ec513acc49)
The update generation path must track and apply all tombstones,
both from the existing base row (if read-before-write was needed)
and for the new row. One such path contained an error, because
it assumed that if the existing row is empty, then the update
can be simply generated from the new row. However, lack of the
existing row can also be the result of a partition/range tombstone.
If that's the case, it needs to be applied, because it's entirely
possible that this partition row also hides the new row.
Without taking the partition tombstone into account, creating
a future tombstone and inserting an out-of-order write before it
in the base table can result in ghost rows in the view table.
This patch comes with a test which was proven to fail before the
changes.
Branches 3.1,3.2,3.3
Fixes#5793
Tests: unit(dev)
Message-Id: <8d3b2abad31572668693ab585f37f4af5bb7577a.1581525398.git.sarna@scylladb.com>
(cherry picked from commit e93c54e837)
Before this patch the iterations over migration_notifier::_listeners
could race with listeners being added and removed.
The addition side is not modified, since it is common to add a
listener during construction and it would require a fairly big
refactoring. Instead, the iteration is modified to use indexes instead
of iterators so that it is still valid if another listener is added
concurrently.
For removal we use a rw lock, since removing an element invalidates
indexes too. There are only a few places that needed refactoring to
handle unregister_listener returning a future<>, so this is probably
OK.
Fixes#5541.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200120192819.136305-1-espindola@scylladb.com>
(cherry picked from commit 27bd3fe203)
After 546556b71b we can have mixed writes into commitlog,
some do flush immediately some do not. If non flushing write races with
flushing one and becomes responsible for writing back its buffer into a
file flush will be skipped which will cause assert in batch_cycle() to
trigger since flush position will not be advanced. Fix that by checking
that flush was skipped and in this case flush explicitly our file
position.
Fixes#5670
Message-Id: <20200128145103.GI26048@scylladb.com>
(cherry picked from commit c654ffe34b)
This patch affects the LWT queries with IF conditions of the
following form: `IF col in :value`, i.e. if the parameter
marker is used.
When executing a prepared query with a bound value
of `(None,)` (tuple with null, example for Python driver), it is
serialized not as NULL but as "empty" value (serialization
format differs in each case).
Therefore, Scylla deserializes the parameters in the request as
empty `data_value` instances, which are, in turn, translated
to non-empty `bytes_opt` with empty byte-string value later.
Account for this case too in the CAS condition evaluation code.
Example of a problem this patch aims to fix:
Suppose we have a table `tbl` with a boolean field `test` and
INSERT a row with NULL value for the `test` column.
Then the following update query fails to apply due to the
error in IF condition evaluation code (assume `v=(null)`):
`UPDATE tbl SET test=false WHERE key=0 IF test IN :v`
returns false in `[applied]` column, but is expected to succeed.
Tests: unit(debug, dev), dtest(prepared stmt LWT tests at https://github.com/scylladb/scylla-dtest/pull/1286)
Fixes: #5710
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200205102039.35851-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit bcc4647552)
"
client_state is used simultaneously by many requests running in parallel
while tracing state pointer is per request. Both those facts do not sit
well together and as a result sometimes tracing state is being overwritten
while still been used by active request which may cause incorrect trace
or even a crash.
"
Fixes#5700.
Backported from 9f1f60fc38
* 'gleb/trace_fix_3.3_backport' of ssh://github.com/scylladb/seastar-dev:
client_state: drop the pointer to a tracing state from client_state
transport: pass tracing state explicitly instead of relying on it been in the client_state
alternator: pass tracing state explicitly instead of relying on it been in the client_state
The table::flush_streaming_mutations is used in the days when streaming
data goes to memtable. After switching to the new streaming, data goes
to sstables directly in streaming, so the sstables generated in
table::flush_streaming_mutations will be empty.
It is unnecessary to invalidate the cache if no sstables are added. To
avoid unnecessary cache invalidating which pokes hole in the cache, skip
calling _cache.invalidate() if the sstables is empty.
The steps are:
- STREAM_MUTATION_DONE verb is sent when streaming is done with old or
new streaming
- table::flush_streaming_mutations is called in the verb handler
- cache is invalidated for the streaming ranges
In summary, this patch will avoid a lot of cache invalidation for
streaming.
Backports: 3.0 3.1 3.2
Fixes: #5769
(cherry picked from commit 5e9925b9f0)
This assert, added by 060e3f8 is supposed to make sure the invariant of
the append() is respected, in order to prevent building an invalid row.
The assert however proved to be too harsh, as it converts any bug
causing out-of-order clustering rows into cluster unavailability.
Downgrade it to on_internal_error(). This will still prevent corrupt
data from spreading in the cluster, without the unavailability caused by
the assert.
Fixes: #5786
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200211083829.915031-1-bdenes@scylladb.com>
(cherry picked from commit 3164456108)
client_state is shared between requests and tracing state is per
request. It is not safe to use the former as a container for the later
since a state can be overwritten prematurely by subsequent requests.
(cherry picked from commit 31cf2434d6)
Multiple requests can use the same client_state simultaneously, so it is
not safe to use it as a container for a tracing state which is per request.
Currently next request may overwrite tracing state for previous one
causing, in a best case, wrong trace to be taken or crash if overwritten
pointer is freed prematurely.
Fixes#5700
(cherry picked from commit 9f1f60fc38)
Multiple requests can use the same client_state simultaneously, so it is
not safe to use it as a container for a tracing state which is per
request. This is not yet an issue for the alternator since it creates
new client_state object for each request, but first of all it should not
and second trace state will be dropped from the client_state, by later
patch.
(cherry picked from commit 38fcab3db4)
Since dpkg does not re-install conffiles when it removed by user,
currently we are missing dependencies.conf and sysconfdir.conf on rollback.
To prevent this, we need to stop running
'rm -rf /etc/systemd/system/scylla-server.service.d/' on 'remove'.
Fixes#5734
(cherry picked from commit 43097854a5)
Change the SCYLLA_REPO_URL variable to point to branch-3.3 instead of
master. This ensures that Docker image builds that don't specify the
variable build from the right repository by default.
awk returns float value on Debian, it causes postinst script failure
since we compare it as integer value.
Replaced with sed + bash.
Fixes#5569
(cherry picked from commit 5627888b7c)
Treat writes to local.paxos as user memory, as the number of writes is
dependent on the amount of user data written with LWT.
Fixes#5682
Message-Id: <20200130150048.GW26048@scylladb.com>
(cherry picked from commit b08679e1d3)
We would sometimes produce an unnecessary extra 0xff prefix byte.
The new encoding matches what cassandra does.
This was both a efficiency and correctness issue, as using varint in a
key could produce different tokens.
Fixes#5656
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
(cherry picked from commit c89c90d07f)
We use eventually() in tests to wait for eventually consistent data
to become consistent. However, we see spurious failures indicating
that we wait too little.
Increasing the timeout has a negative side effect in that tests that
fail will now take longer to do so. However, this negative side effect
is negligible to false-positive failures, since they throw away large
test efforts and sometimes require a person to investigate the problem,
only to conclude it is a false positive.
This patch therefore makes eventually() more patient, by a factor of
32.
Fixes#4707.
Message-Id: <20200130162745.45569-1-avi@scylladb.com>
(cherry picked from commit ec5b721db7)
We need to add '~' to handle rcX version correctly on Debian variants
(merged at ae33e9f), but when we moved to relocated package we mistakenly
dropped the code, so add the code again.
Fixes#5641
(cherry picked from commit dd81fd3454)
Docker restricts the number of processes in a container to some
limit it calculates. This limit turns out to be too low on large
machines, since we run multiple links in parallel, and each link
runs many threads.
Remove the limit by specifying --pids-limit -1. Since dbuild is
meant to provide a build environment, not a security barrier,
this is okay (the container is still restricted by host limits).
I checked that --pids-limit is supported by old versions of
docker and by podman.
Fixes#5651.
Message-Id: <20200127090807.3528561-1-avi@scylladb.com>
(cherry picked from commit 897320f6ab)
A mistake in handling legacy checks for special 'idx_token' column
resulted in not recognizing materialized views backing secondary
indexes properly. The mistake is really a typo, but with bad
consequences - instead of checking the view schema for being an index,
we asked for the base schema, which is definitely not an index of
itself.
Branches 3.1,3.2 (asap)
Fixes#5621Fixes#4744
(cherry picked from commit 9b379e3d63)
The atomic_cell pretty printers use a mix of commas and semicolons.
This change makes them use commas everywhere, for consistency.
Message-Id: <20200116133327.2610280-1-avi@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5567
from Calle Wilund:
Fixes#5314
Instead of tying CDC handling into cql statement objects, this patch set
moves it to storage proxy, i.e. shared code for mutating stuff. This means
we automatically handle cdc for code paths outside cql (i.e. alternator).
It also adds api handling (though initially inefficient) for batch statements.
CDC is tied into storage proxy by giving the former a ref to the latter (per
shard). Initially this is not a constructor parameter, because right now we
have chicken and egg issues here. Hopefully, Pavels refactoring of migration
manager and notifications will untie these and this relationship can become
nicer.
The actual augmentation can (as stated above) be made much more efficient.
Hopefully, the stream management refactoring will deal with expensive stream
lookup, and eventually, we can maybe coalesce pre-image selects for batches.
However, that is left as an exercise for when deemed needed.
The augmentation API has an optional return value for a "post-image handler"
to be used iff returned after mutation call is finished (and successful).
It is not yet actually invoked from storage_proxy, but it is at least in the
call chain.
The set make dependencies between mm and other services cleaner,
in particular, after the set:
- the query processor no longer needs migration manager
(which doesn't need query processor either)
- the database no longer needs migration manager, thus the mutual
dependency between these two is dropped, only migration manager
-> database is left
- the migration manager -> storage_service dependency is relaxed,
one more patchset will be needed to remove it, thus dropping one
more mutual dependency between them, only the storage_service
-> migration manager will be left
- the migration manager is stopped on drain, but several more
services need it on stop, thus causing use after free problems,
in particular there's a caught bug when view builder crashes
when unregistering from notifier list on stop. Fixed.
Tests: unit(dev)
Fixes: #5404
Enabling asan enables a few cleanup optimizations in gcc. The net
result is that using
-fsanitize=address -fno-sanitize-address-use-after-scope
Produces code that uses a lot less stack than if the file is compiled
with just -O0.
This patch adds -O1 in addition to
-fno-sanitize-address-use-after-scope to protect the unfortunate
developer that decides to build in dev mode with --cflags='-O0 -g'.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200116012318.361732-2-espindola@scylladb.com>
It is sometimes convenient to build with flags that don't match any
existing mode.
Recently I was tracking a bug that would not reproduce with debug, but
reproduced with dev, so I tried debugging the result of
./configure.py --cflags="-O0 -g"
While the binary had debug info, it still had optimizations because
configure.py put the mode flags after the user flags (-O0 -O1). This
patch flips the order (-O1 -O0) so that the flags passed in the
command line win.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200116012318.361732-1-espindola@scylladb.com>
CQL transport code relies on an exception's C++ type to create correct
reply, but in lwt we converted some mutation_timeout exceptions to more
generic request_timeout while forwarding them which broke the protocol.
Do not drop type information.
Fixes#5598.
Message-Id: <20200115180313.GQ9084@scylladb.com>
Murmur3 is the default partitioner.
ByteOrder and Random are the deprecated ones
and should be mentioned in the description.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5294 from
Amnon Heiman:
To use a snapshot we need a schema file that is similar to the result of
running cql DESCRIBE command.
The DESCRIBE is implemented in the cql driver so the functionality needs
to be re-implemented inside scylla.
This series adds a describe method to the schema file and use it when doing
a snapshot.
There are different approach of how to handle materialize views and
secondary indexes.
This implementation creates each schema.cql file in its own relevant
directory, so the schema for materializing view, for example, will be
placed in the snapshot directory of the table of that view.
Fixes#4192
This commit makes most sleeps in gossip.cc abortable. It is now possible
to quickly shut down a node during startup, most notably during the
phase while it waits for gossip to settle.
This reduces network traffic and eliminates time for installation when
building packages from the frozen toolchain, as well as isolating the
build from updates to those dependencies which may cause breakage.
This patch set adds support for CQL tests to test.py,
as well as many other improvements:
* --name is now a positional argument
* test output is preserved in testlog/${mode}
* concise output format
* better color support
* arbitrary number of test suites
* per-suite yaml-based configuration
* options --jenkins and --xunit are removed and xml
files are generated for all runs
A simple driver is written in C++ to read CQL for
standard input, execute in embedded mode and produce output.
The patch is checked with BYO.
Reviewed-by: Dejan Mircevski <dejan@scylladb.com>
* 'test.py' of github.com:/scylladb/scylla-dev: (39 commits)
test.py: introduce BoostTest and virtualize custom boost arguments
test.py: sort tests within a suite, and sort suites
test.py: add a basic CQL test
test.py: add CQL .reject files to gitignore
test.py: print a colored unidiff in case of test failure
test.py: add CqlTestSuite to run CQL tests
test.py: initial import of CQL test driver, cql_repl
test.py: remove custom colors and define a color palette
test.py: split test output per test mode
test.py: remove tests_to_run
test.py: virtualize Test.run(), to introduce CqlTest.Run next
test.py: virtualize test search pattern per TestSuite
test.py: virtualize write_xunit_report()
test.py: ensure print_summary() is agnostic of test type
test.py: tidy up print_summary()
test.py: introduce base class Test for CQL and Unit tests
test.py: move the default arguments handling to UnitTestSuite
test.py: move custom unit test command line arguments to suite.yaml
test.py: move command line argument processing to UnitTestSuite
test.py: introduce add_test(), which is suite-specific
...
The Ubuntu-based Docker image uses Scylla 1.0 and has not been updated
since 2017. Let's remove it as unmaintained.
Message-Id: <20200115102405.23567-1-penberg@scylladb.com>
"
Currently commitlog supports two modes of operation. First is 'periodic'
mode where all commitlog writes are ready the moment they are stored in
a memory buffer and the memory buffer is flushed to a storage periodically.
Second is a 'batch' mode where each write is flushed as soon as possible
(after previous flush completed) and writes are only ready after they
are flushed.
The first option is not very durable, the second is not very efficient.
This series adds an option to mark some writes as "more durable" in
periodic mode meaning that they will be flushed immediately and reported
complete only after the flush is complete (flushing a durable write also
flushes all writes that came before it). It also changes paxos to use
those durable writes to store paxos state.
Note that strictly speaking the last patch is not needed since after
writing to an actual table the code updates paxos table and the later
uses durable writes that make sure all previous writes are flushed. Given
that both writes supposed to run on the same shard this should be enough.
But it feels right to make base table writes durable as well.
"
* 'gleb/commilog_sync_v4' of github.com:scylladb/seastar-dev:
paxos: immediately sync commitlog entries for writes made by paxos learn stage
paxos: mark paxos table schema as "always sync"
schema: allow schema to be marked as 'always sync to commitlog'
commitlog: add test for per entry sync mode
database: pass sync flag from db::apply function to the commitlog
commitlog: add sync method to entry_writer
Before this patch result_set_assertions was handling both null values
and missing values in the same way.
This patch changes the handling of missing values so that now checking
for a null value is not the same as checking for a value not being
present.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200114184116.75546-1-espindola@scylladb.com>
3ec889816 changed cell::make_collection() to take different code paths
depending whether its `data` argument is nothrow copyable/movable or
not. In case it is not, it is wrapped in a view to make it so (see the
above mentioned commit for a full explanation), relying on the methods
pre-existing requirement for callers to keep `data` alive while the
created writer is in use.
On closer look however it turns out that this requirement is neither
respected, nor enforced, at least not on the code level. The real
requirement is that the underlying data represented by `data` is kept
alive. If `data` is a view, it is not expected to be kept alive and
callers don't, it is instead copied into `make_collection()`.
Non-views however *are* expected to be kept alive. This makes the API
error prone.
To avoid any future errors due to this ambiguity, require all `data`
arguments to be nothrow copyable and movable. Callers are now required
to pass views of nonconforming objects.
This patch is a usability improvement and is not fixing a bug. The
current code works as-is because it happens to conform to the underlying
requirements.
Refs: #5575
Refs: #5341
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200115084520.206947-1-bdenes@scylladb.com>
This patch adds tests for the describe method.
test_describe_simple_schema tests regular tables.
test_describe_view_schema tests view and index.
Each test, create a table, find the schema, call the describe method and
compare the results to the string that was used to create the table.
The view tests also verify that adding an index or view does not change
the base table.
When comparing results, leading and trailing white spaces are ignored
and all combination of whitespaces and new lines are treated equaly.
Additional tests may be added at a future phase if required.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When creating a snapshot we need to add a schema.cql file in the
snapshot directory that describes the table in that snapshot.
This patch adds the file using the schema describe method.
get_snapshot_details and manifest_json_filter were modified to ignore
the schema.cql file.
Fixes#4192
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a describe method to a table schema.
It acts similar to a DESCRIBE cql command that is implemented in a CQL
driver.
The method supports tables, secondary indexes local indexes and
materialize views.
relates to: #4192
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
index_name_from_table_name is a reverse of index_table_name,
it gets a table name that was generated for an index and return the name
of the index that generated that table.
Relates to #4192
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The factory is purely a state-less thing, there is no difference what
instance of it to use, so we may omit referencing the storage_service
in passive_announce
This is 2nd simple migration_manager -> storage_service link to cut
(more to come later).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places where migration_manager needs storage_service
reference to get the database from, thus forming the mutual dependency
between them. This is the simplest case where the migration_manager
link to the storage_service can be cut -- the databse reference can be
obtained from storage_proxy instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the last place where database code needs the migration_manager
instance to be alive, so now the mutual dependency between these two
is gone, only the migration_manager needs the database, but not the
vice-versa.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage_server needs migration_manager for notifications and
carefully handles the manager's stop process not to demolish the
listeners list from under itself. From now on this dependency is
no longer valid (however the storage_service seems still need the
migration_manager, but this is different story).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch removes an implicit cql_server -> migration_manager
dependency, as the former's event notifier uses the latter
for notifications.
This dependency also breaks a loop:
storage_service -> cql_server -> migration_manager -> storage_service
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch breaks one (probably harmless but still) dependency
loop. The query_processor -> migration_manager -> storage_proxy
-> tracing -> query_processor.
The first link is not not needed, as the query_processor needs the
migration_manager purely to (ub)subscribe on notifications.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The same as with view builder. The constructor still needs both,
but the life-time reference is now for notifier only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The migration manager itself is still needed on start to wait
for schema agreement, but there's no longer the need for the
life-time reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Do not call for local migration manager instance to send notifications,
call for the local migration notifier, it will always be alive.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service will need this guy to initialize sub-services
with. Also it registers itself with notifiers.
That said, it's convenient to have the migration notifier on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The _listeners list on migration_manager class and the corresponding
notify_xxx helpers have nothing to do with the its instances, they
are just transport for notification delivery.
At the same time some services need the migration manager to be alive
at their stop time to unregister from it, while the manager itself
may need them for its needs.
The proposal is to move the migration notifier into a complete separate
sharded "service". This service doesn't need anything, so it's started
first and stopped last.
While it's not effectively a "migration" notifier, we inherited the name
from Cassandra and renaming it will "scramble neurons in the old-timers'
brains but will make it easier for newcomers" as Avi says.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The comparator is refreshed to ensure the following:
- null compares less to all other types;
- null, true and false are comparable against each other,
while other types are only comparable against themselves and null.
Comparing mixed types is not currently reachable from the alternator
API, because it's only used for sets, which can only use
strings, binary blobs and numbers - thus, no new pytest cases are added.
Fixes#5454
When counter mutation is about to be sent, a leader is elected, but
if the leader fails after election, we get `rpc::closed_error`. The
exception propagates high up, causing all connections to be dropped.
This patch intercepts `rpc::closed_error` in `storage_proxy::mutate_counters`
and translates it to `mutation_write_failure_exception`.
References #2859
Run the test and compare results. Manage temporary
and .reject files.
Now that there are CQL tests, improve logging.
run_test success no longer means test success.
cql_repl is a simple program which reads CQL from stdin,
executes it, and writes results to stdout.
It support --input, --output and --log options.
--log is directed to cql_test.log by default.
--input is stdin by default
--output is stdout by default.
The result set output is print with a basic
JSON visitor.
Store test temporary files and logs in ${testdir}/${mode}.
Remove --jenkins and --xunit, and always write XML
files at a predefined location: ${testdir}/${mode}/xml/.
Use .xunit.xml extension for tests which XML output is
in xunit format, and junit.xml for an accumulated output
of all non-boost tests in junit format.
Load the command line arguments, if any, from suite.yaml, rather
than keep them hard-coded in test.py.
This is allows operations team to have easier access to these.
Note I had to sacrifice dynamic smp count for mutation_reader_test
(the new smp count is fixed at 3) since this is part
of test configuration now.
This way we can avoid iterating over all tests
to handle --repeat.
Besides, going forward the tests will be stored
in two places: in the global list of all tests,
for the runner, and per suite, for suite-based
reporting, so it's easier if TestSuite
if fully responsible for finding and adding tests.
Scan entire test/ for folders that contain suite.yaml,
and load tests from these folders. Skip the rest.
Each folder with a suite.yaml is expected to have a valid
suite configuration in the yaml file.
A suite is a folder with test of the same type. E.g.
it can be a folder with unit tests, boost tests, or CQL
tests.
The harness will use suite.yaml to create an appropriate
suite test driver, to execute tests in different formats.
It reduces the number of configurations to re-test when test.py is
modified. and simplifies usage of test.py in build tools, since you no
longer need to bother with extra arguments.
Going forward I'd like to make terminal output brief&tabular,
but some test details are necessary to preserve so that a failure
is easy to debug. This information now goes to the log file.
- open and truncate the log file on each harness start
- log options of each invoked test in the log, so that
a failure is easy to reproduce
- log test result in the log
Since tests are run concurrently, having an exact
trace of concurrent execution also helps
debugging flaky tests.
The storage_service struct is a collection of diverse things,
most of them requiring only on start and on stop and/or runing
on shard 0 (but is nonetheless sharded).
As a part of clearing this structure and generated by it inter-
-componenes dependencies, here's the sanitation of load_broadcaster.
Fixes#5582
... but only populate log on shard 0.
Migration manager callbacks are slightly assymetric. Notifications
for pre-create/update mutations are sent only on initiating shard
(neccesary, because we consider the mutations mutable).
But "created" callbacks are sent on all shards (immutable).
We must subscribe on all shards, but still do population of cdc table
only once, otherwise we can either miss table creat or populate
more than once.
v2:
- Add test case
Message-Id: <20200113140524.14890-1-calle@scylladb.com>
* seastar 36cf5c5ff0...3f3e117de3 (16):
> memcached: don't use C++17-only std::optional
> reactor: Comment why _backend is assigned in constructor body
> log: restore --log-to-stdout for backward compatibility
> used_size.hh: Include missing headers
> core: Move some code from reactor.cc to future.cc
> future-util: move parallel_for_each to future-util.cc
> task: stop wrapping tasks with unique_ptr
> Merge "Setup timer signal handler in backend constructor" from Pavel
Fixes#5524
> future: avoid a branch in future's move constructor if type is trivial
> utils: Expose used_size
> stream: Call get_future early
> future-util: Move parallel_for_each_state code to a .cc
> memcached: log exceptions
> stream: Delete dead code
> core: Turn pollable_fd into a simple proxy over pollable_fd_state.
> Merge "log to std::cerr" from Benny
This is the part of de-bloating storage_service.
The field in question is used to temporary keep the _token_metadata
value during shard-wide replication. There's no need to have it as
class member, any "local" copy is enough.
Also, as the size of token_metadata is huge, and invoke_on_all()
copies the function for each shard, keep one local copy of metadata
using do_with() and pass it into the invoke_on_all() by reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reviewed-by: Asias He <asias@scylladb.com>
Message-Id: <20200113171657.10246-1-xemul@scylladb.com>
The query option always_return_static_content was added for lightweight
transations in commits e0b31dd273 (infrastructure) and 65b86d155e
(actual use). However, the flag was added unconditionally to
update_parameters::options. This caused it to be set for list
read-modify-write operations, not just for lightweight transactions.
This is a little wasteful, and worse, it breaks compatibility as old
nodes do not understand the always_return_static_content flag and
complain when they see it.
To fix, remove the always_return_static_content from
update_parameters::options and only set it from compare-and-swap
operations that are used to implement lightweight transactions.
Fixes#5593.
Reviewed-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20200114135133.2338238-1-avi@scylladb.com>
The drain_in_progress variable here is the future that's set by the
drain() operation itself. Its promise is set when the drain() finishes.
The check for this future in the beginning of drain() is pointless.
No two drain()-s can run in parallels because of run_with_api_lock()
protection. Doing the 2nd drain after successfull 1st one is also
impossible due to the _operation_mode check. The 2nd drain after
_exceptioned_ (and thus incomplete) 1st one will deadlock, after
this patch will try to drain for the 2nd time, but that should by ok.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200114094724.23876-1-xemul@scylladb.com>
This change introduces system.clients table, which provides
information about CQL clients connected.
PK is the client's IP address, CK consists of outgoing port number
and client_type (which will be extended in future to thrift/alternator/redis).
Table supplies also shard_id and username. Other columns,
like connection_stage, driver_name, driver_version...,
are currently empty but exist for C* compatibility and future use.
This is an ordinary table (i.e. non-virtual) and it's updated upon
accepting connections. This is also why C*'s column request_count
was not introduced. In case of abrupt DB stop, the table should not persist,
so it's being truncated on startup.
Resolves#4820
"
Most of the code in `cell` and the `imr` infrastructure it is built on
is `noexcept`. This means that extra care must be taken to avoid rouge
exceptions as they will bring down the node. The changes introduced by
0a453e5d3a did just that - introduced rouge `std::bad_alloc` into this
code path by violating an undocumented and unvalidated assumption --
that fragment ranges passed to `cell::make_collection()` are nothrow
copyable and movable.
This series refactors `cell::make_collection()` such that it does not
have this assumption anymore and is safe to use with any range.
Note that the unit test included in this series, that was used to find
all the possible exception sources will not be currently run in any of
our build modes, due to `SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION` not
being set. I plan to address this in a followup because setting this
flags fails other tests using the failure injection mechanism. This is
because these tests are normally run with the failure injection disabled
so failures managed to lurk in without anyone noticing.
Fixes: #5575
Refs: #5341
Tests: unit(dev, debug)
"
* 'data-cell-make-collection-exception-safety/v2' of https://github.com/denesb/scylla:
test: mutation_test: add exception safety test for large collection serialization
data/cell.hh: avoid accidental copies of non-nothrow copiable ranges
utils/fragment_range.hh: introduce fragment_range_view
We do not yet support the ScanIndexForward=false option for reversing
the sort order of a Query operation, as reported in issue #5153.
But even before implementing this feature, it is important that we
produce an error if a user attempts to use it - instead of outright
ignoring this parameter and giving the user wrong results. This is
what this patch does.
Before this patch, the reverse-order query in the xfailing test
test_query.py::test_query_reverse seems to succeed - yet gives
results in the wrong order. With this patch, the query itself fails -
stating that the ScanIndexForward=false argument is not supported.
Refs #5153
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200105113719.26326-1-nyh@scylladb.com>
Here's another theoretical problem, that involves 3 sequential calls
to respectively removenode, force_removenode and some other operation.
Let's walk through them
First goes the removenode:
run_with_api_lock
_operation_in_progress = "removenode"
storage_service::remove_node
sleep in replicating_nodes.empty() loop
Now the force_removenode can run:
run_with_no_api_lock
storage_service::force_removenode
check _operation_in_progress (not empty)
_force_remove_completion = true
sleep in _operation_in_progress.empty loop
Now the 1st call wakes up and:
if _force_remove_completion == true
throw <some exception>
.finally() handler in run_with_api_lock
_operation_in_progress = <empty>
At this point some other operation may start. Say, drain:
run_with_api_lock
_operation_in_progress = "drain"
storage_service::drain
...
go to sleep somewhere
No let's go back to the 1st op that wakes up from its sleep.
The code it executes is
while (!ss._operation_in_progress.empty()) {
sleep_abortable()
}
and while the drain is running it will never exit.
However (! and this is the core of the race) should the drain
operation happen _before_ the force_removenode, another check
for _operation_in_progress would have made the latter exit with
the "Operation drain is in progress, try again" message.
Fix this inconsistency by making the check for current operation
every wake-up from the sleep_abortable.
Fixes#5591
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Here's a theoretical problem, that involves 3 sequential calls
to respectively removenode, force_removenode and removenode (again)
operations. Let's walk through them
First goes the removenode:
run_with_api_lock
_operation_in_progress = "removenode"
storage_service::remove_node
sleep in replicating_nodes.empty() loop
Now the force_removenode can run:
run_with_no_api_lock
storage_service::force_removenode
check _operation_in_progress (not empty)
_force_remove_completion = true
sleep in _operation_in_progress.empty loop
Now the 1st call wakes up and:
if _force_remove_completion == true
_force_remove_completion = false
throw <some exception>
.finally() handler in run_with_api_lock
_operation_in_progress = <empty>
! at this point we have _force_remove_completion = false and
_operation_in_progress = <empty>, which opens the following
opportunity for the 3d removenode:
run_with_api_lock
_operation_in_progress = "removenode"
storage_service::remove_node
sleep in replicating_nodes.empty() loop
Now here's what we have in 2nd and 3rd ops:
1. _operation_in_progress = "removenode" (set by 3rd) prevents the
force_removenode from exiting its loop
2. _force_remove_completion = false (set by 1st on exit) prevents
the removenode from waiting on replicating_nodes list
One can start the 4th call with force_removenode, it will proceed and
wake up the 3rd op, but after it we'll have two force_removenode-s
running in parallel and killing each other.
I propose not to set _force_remove_completion to false in removenode,
but just exit and let the owner of this flag unset it once it gets
the control back.
Fixes#5590
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Other types do not have a wider accumulator at the moment.
And static_cast<accumulator_type>(ret) != _sum evaluates as
false for NaN/Inf floating point values.
Fixes#5586
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200112183436.77951-1-bhalevy@scylladb.com>
Now that atomic_cell_view and collection_mutation_view have
type-aware printers, we can use them in the type-aware atomic_cell_or_collection
printer.
Message-Id: <20191231142832.594960-1-avi@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5533
from Avi Kivity:
canonical_mutation objects are used for schema reconciliation, which is a
fragile area and thus deserves some debugging help.
This series makes canonical_mutation objects printable.
Merged patch series from Piotr Sarna:
"Previous assumption was that there can only be one regular base column
in the view key. The assumption is still correct for tables created
via CQL, but it's internally possible to create a view with multiple
such columns - the new assumption is that if there are multiple columns,
they share their liveness.
This series is vital for indexing to work properly on alternator,
so it would be best to solve the issue upstream. I strived to leave
the existing semantics intact as long as only up to one regular
column is part of the materialized view primary key, which is the case
for Scylla's materialized views. For alternator it may not be true,
but all regular columns in alternator share liveness info (since
alternator does not support per-column TTL), which is sufficient
to compute view updates in a consistent way.
Fixes#5006
Tests: unit(dev), alternator(test_gsi_update_second_regular_base_column, tic-tac-toe demo)"
Piotr Sarna (3):
db,view: fix checking if partition key is empty
view: handle multiple regular base columns in view pk
test: add a case for multiple base regular columns in view key
alternator-test/test_gsi.py | 1 -
view_info.hh | 5 +-
cql3/statements/alter_table_statement.cc | 2 +-
db/view/view.cc | 77 ++++++++++++++----------
mutation_partition.cc | 2 +-
test/boost/cql_query_test.cc | 58 ++++++++++++++++++
6 files changed, 109 insertions(+), 36 deletions(-)
Merged patch series from Gleb Natapov:
"LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by the transport
code that jumps to a correct shard and re-process incoming message there.
The nicer way to achieve the same would be to jump to a right shard
inside of the storage_proxy::cas(), but unfortunately with current
implementation of the modification statements they are unusable by
a shard different from where it was created, so the jump should happen
before a modification statement for an cas() is created. When we fix our
cql code to be more cross-shard friendly this can be reworked to do the
jump in the storage_proxy."
Gleb Natapov (4):
transport: change make_result to takes a reference to cql result
instead of shared_ptr
storage_service: move start_native_transport into a thread
lwt: Process lwt request on a owning shard
lwt: drop invoke_on in paxos_state prepare and accept
auth/service.hh | 5 +-
message/messaging_service.hh | 2 +-
service/client_state.hh | 30 +++-
service/paxos/paxos_state.hh | 10 +-
service/query_state.hh | 6 +
service/storage_proxy.hh | 2 +
transport/messages/result_message.hh | 20 +++
transport/messages/result_message_base.hh | 4 +
transport/request.hh | 4 +
transport/server.hh | 25 ++-
cql3/statements/batch_statement.cc | 6 +
cql3/statements/modification_statement.cc | 6 +
cql3/statements/select_statement.cc | 8 +
message/messaging_service.cc | 2 +-
service/paxos/paxos_state.cc | 48 ++---
service/storage_proxy.cc | 47 ++++-
service/storage_service.cc | 120 +++++++------
test/boost/cql_query_test.cc | 1 +
thrift/handler.cc | 3 +
transport/messages/result_message.cc | 5 +
transport/server.cc | 203 ++++++++++++++++------
21 files changed, 377 insertions(+), 180 deletions(-)
Use `seastar::memory::local_failure_injector()` to inject al possible
`std::bad_alloc`:s into the collection serialization code path. The test
just checks that there are no `std::abort()`:s caused by any of the
exceptions.
The test will not be run if `SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION` is
not defined.
`cell::make_collection()` assumes that all ranges passed to it are
nothrow copyable and movable views. This is not guaranteed, is not
expressed in the interface and is not mentioned in the comments either.
The changes introduced by 0a453e5d3a to collection serialization, making
it use fragmented buffers, fell into this trap, as it passes
`bytes_ostream` to `cell::make_collection()`. `bytes_ostream`'s copy
constructor allocates and hence can throw, triggering an
`std::terminate()` inside `cell::make_collection()` as the latter is
noexcept.
To solve this issue, non-nothrow copyable and movable ranges are now
wrapped in a `fragment_range_view` to make them so.
`cell::make_collection()` already requires callers to keep alive the
range for the duration of the call, so this does not introduce any new
requirements to the callers. Additionally, to avoid any future
accidents, do not accept temporaries for the `data` parameter. We don't
ever want to move this param anyway, we will either have a trivially
copyable view, or a potentially heavy-weight range that we will create a
trivially copyable view of.
A lightweight, trivially copyable and movable view for fragment ranges.
Allows for uniform treatment of all kinds of ranges, i.e. treating all
of them as a view. Currently `fragment_range.hh` provides lightweight,
view-like adaptors for empty and single-fragment ranges (`bytes_view`). To
allow code to treat owning multi-fragment ranges the shame way as the
former two, we need a view for the latter as well -- this is
`fragment_range_view`.
Resolves#4820. Execution path in main.cc now cleans up system.clients
table if it exists (this is done on startup). Also, server.cc now calls
functions that notify about cql clients connecting/disconnecting.
This simplifies the storage_service API and fixes the
complain about shared_ptr usage instead of unique_ptr.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a lonely get_load_map() call on storage_service that
needs only load broadcaster, always runs on shard 0 and that's it.
Next patch will move this whole stuff into its own helper no-shard
container and this is preparation for this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Since lwt requests are now running on an owning shard there is no longer
a need to invoke cross shard call on paxos_state level. RPC calls may
still arrive to a wrong shard so we need to make cross shard call there.
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by transport code
that jumps to a correct shard and re-process incoming message there.
"
The original fix (10f6b125c8) didn't
take into account that if there was a failed memtable flush (Refs
flush) but is not a flushable memtable because it's not the latest in
the memtable list. If that happens, it means no other memtable is
flushable as well, cause otherwise it would be picked due to
evictable_occupancy(). Therefore the right action is to not flush
anything in this case.
Suspected to be observed in #4982. I didn't manage to reproduce after
triggering a failed memtable flush.
Fixes#3717
"
* tag 'avoid-ooming-with-flush-continuations-v2' of github.com:tgrabiec/scylla:
database: Avoid OOMing with flush continuations after failed memtable flush
lsa: Introduce operator bool() to occupancy_stats
lsa: Expose region_impl::evictable_occupancy in the region class
"
Fix overflow handling in sum() and avg().
sum:
- aggregated into __int128
- detect overflow when computing result and log a warning if found
avg:
- fix division function to divide the accumulator type _sum (__int128 for integers) by _count
Add unit tests for both cases
Test:
- manual test against Cassandra 3.11.3 to make sure the results in the scylla unit test agree with it.
- unit(dev), cql_query_test(debug)
Fixes#5536
"
* 'cql3-sum-overflow' of https://github.com/bhalevy/scylla:
test: cql_query_test: test avg overflow
cql3: functions: protect against int overflow in avg
test: cql_query_test: test sum overflow
cql3: functions: detect and handle int overflow in sum
exceptions: sort exception_code definitions
exceptions: define additional cassandra CQL exceptions codes
"
We were failing to start a thread when the UDF call was nested in an
aggregate function call like SUM.
"
* 'espindola/fix-sum-of-udf' of https://github.com/espindola/scylla:
cql3: Fix indentation
cql3: Add missing with_thread_if_needed call
cql3: Implement abstract_function_selector::requires_thread
remove make_ready_future call
This was initialized to api::missing_timestamp but
should be set to either a client provided-timestamp or
the server's.
Unlike write operations, this timestamp need not be unique
as the one generated by client_state::get_timestamp.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200108074021.282339-2-bhalevy@scylladb.com>
exec->_cmd->read_timestamp may be initialized by default to api::min_timestamp,
causing:
service/storage_proxy.cc:3328:116: runtime error: signed integer overflow: 1577983890961976 - -9223372036854775808 cannot be represented in type 'long int'
Aborting on shard 1.
Do not optimize cross-dc repair if read_timestamp is missing (or just negative)
We're interested in reads that happen within write_timeout of a write.
Fixes#5556
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200108074021.282339-1-bhalevy@scylladb.com>
The cdc service is assigned from outside, post construction, mainly
because of the chickens and eggs in main startup. Would be nice to
have it unconditionally, but this is workable.
It is (and shall) only be called from inside storage proxy,
and we would like this to be reflected in the interface
so our eventual moving of cdc logic into the mutate call
chains become easier to verify and comprehend.
To eventually replace the free function.
Main difference is this is build to both handle batches correctly
and to eventually allow hanging cdc object on storage proxy,
and caches on the cdc object.
The test case checks that having two base regular columns
in the materialized view key (not obtainable via CQL),
still works fine when values are inserted or deleted.
If TTL was involved and these columns would have different expiration
rules, the case would be more complicated, but it's not possible
for a user to reach that case - neither with CQL, nor with alternator.
Previous assumption was that there can only be one regular base column
in the view key. The assumption is still correct for tables created
via CQL, but it's internally possible to create a view with multiple
such columns - the new assumption is that if there are multiple columns,
they share their liveness.
This patch is vital for indexing to work properly on alternator,
so it would be best to solve the issue upstream. I strived to leave
the existing semantics intact as long as only up to one regular
column is part of the materialized view primary key, which is the case
for Scylla's materialized views. For alternator it may not be true,
but all regular columns in alternator share liveness info (since
alternator does not support per-column TTL), which is sufficient
to compute view updates in a consistent way.
Fixes#5006
Tests: unit(dev), alternator(test_gsi_update_second_regular_base_column, tic-tac-toe demo)
Message-Id: <c9dec243ce903d3a922ce077dc274f988bcf5d57.1567604945.git.sarna@scylladb.com>
Now that position_in_partition_view has type-aware printing, use it
to provide a human readable version of clustering keys.
Message-Id: <20191231151315.602559-2-avi@scylladb.com>
If the position_in_partition_view represents a clustering key,
we can now see it with the clustering key decoded according to
the schema.
Message-Id: <20191231151315.602559-1-avi@scylladb.com>
Previous implementation did not take into account that a column
in a partition key might exist in a mutation, but in a DEAD state
- if it's deleted. There are no regressions for CQL, while for
alternator and its capability of having two regular base columns
in a view key, this additional check must be performed.
This reduces code bloat and makes the code friendlier for IDEs, as the
IDE now understands the type of create_schema.
Message-Id: <20191231134803.591190-1-avi@scylladb.com>
sstables::write_simple() has quite a lot of boilerplate
which gets replicated into each template instance. Move
all of that into a non-template do_write_simple(), leaving
only things that truly depend on the component being written
in the template, and encapsulating them with a
noncopyable_function.
An explicit template instantiation was added, since this
is used in a header file. Before, it likely worked by
accident and stopped working when the template became
small enough to inline.
Tests: unit (dev)
Message-Id: <20200106135453.1634311-1-avi@scylladb.com>
mutation_partition_view now supports a compile-time resolved visitor.
This is performant but results in bloat when the performance is not
needed. Furthermore, the template function that applies the object
to the visitor is private and out-of-line, to reduce compile time.
To allow visitation on mutation_partition_view objects, add a virtual
visitor type and a non-template accept function.
Note: mutation_partition_visitor is very similar to the new type,
but different enough to break the template visitor which is used
to implement the new visitor.
The new visitor will be used to implement pretty printing for
canonical_mutation.
Consider this:
1) Write partition_start of p1
2) Write clustering_row of p1
3) Write partition_end of p1
4) Repair is stopped due to error before writing partition_start of p2
5) Repair calls repair_row_level_stop() to tear down which calls
wait_for_writer_done(). A duplicate partition_end is written.
To fix, track the partition_start and partition_end written, avoid
unpaired writes.
Backports: 3.1 and 3.2
Fixes: #5527
commit 21dec3881c introduced
a bug that will cause scylla debian build to fail. This is
because the commit relied on the environment PRODUCT variable
to be exported (and as a result, to propogate to the rename
command that is executed by find in a subshell)
This commit fixes it by explicitly passing the PRODUCT variable
into the rename command.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20200106102229.24769-1-eliransin@scylladb.com>
In scylla, the replacing node is set as HIBERNATE status. It is the only
place we use HIBERNATE status. The replacing node is supposed to be
alive and updating its heartbeat, so it is not supposed to be in dead
state.
This patch fixes the following problem in replacing.
1) start n1, n2
2) n2 is down
3) start n3 to replace n2, but kill n3 in the middle of the replace
4) start n4 to replace n2
After step 3 and step 4, the old n3 will stay in gossip forever until a
full cluster shutdown. Note n3 will only stay in gossip but in
system.peers table. User will see the annoying and infinite logs like on
all the nodes
rpc - client $ip_of_n3:7000: fail to connect: Connection refused
Fixes: #5449
Tests: replace_address_test.py + manual test
VERSION_ID of centos7 is "7", but VERSION_ID of oel7.7 is "7.7"
scylla_ntp_setup doesn't work on OEL7.7 for ValueError.
- ValueError: invalid literal for int() with base 10: '7.7'
This patch changed redhat_version() to return version string, and compare
with parse_version().
Fixes#5433
Signed-off-by: Amos Kong <amos@scylladb.com>
When the progress is queried, e.g., query from nodetool netstats
the progress info might not be updated yet.
Fix it by checking before access the map to avoid errors like:
std::out_of_range (_Map_base::at)
Fixes: #5437
Tests: nodetool_additional_test.py:TestNodetool.netstats_test
This depends on the just emailed fixes to undefined behavior in
tests. With this change we should quickly notice if a change
introduces undefined behavior.
Fixes#4054
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191230222646.89628-1-espindola@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5538 from
Avi Kivity and Piotr Jastrzębski.
This series prepares CDC for rolling upgrade. This consists of
reducing the footprint of cdc, when disabled, on the schema, adding
a cluster feature, and redacting the cdc column when transferring
it to other nodes. The latter is needed because we'll want to backport
this to 3.2, which doesn't have canonical_mutations yet.
If in memory buffer has not enough space for incoming mutation it is
written into a file, but the code missed updating timestamp of a last
sync, so we may sync to often.
Message-Id: <20200102155049.21291-9-gleb@scylladb.com>
The code that enters the gate never defers before leaving, so the gate
behaves like a flag. Lets use existing flag to prohibit adding data to a
closed segment.
Message-Id: <20200102155049.21291-8-gleb@scylladb.com>
Currently segment closing code is spread over several functions and
activated based on the _closed flag. Make segment closing explicit
by moving all the code into close() function and call it where _closed
flag is set.
Message-Id: <20200102155049.21291-6-gleb@scylladb.com>
Currently sync() does two completely different things based on the
shutdown parameter. Separate code into two different function.
Message-Id: <20200102155049.21291-3-gleb@scylladb.com>
The original "test_schema_digest_does_not_change" test case ensures
that schema digests will match for older nodes that do not support
all the features yet (including computed columns).
The additional case uses sstables generated after CDC was enabled
and a table with CDC enabled is created,
in order to make sure that the digest computed
including CDC column does not change spuriously as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Addition of cdc column in scylla_tables changes how schema
digests are calculated, and affect the ABI of schema update
messages (adding a column changes other columns' indexes
in frozen_mutation).
To fix this, extend the schema_tables mechanism with support
for the cdc column, and adjust schemas and mutations to remove
that column when sending schemas during upgrade.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
An empty cdc column in scylla_tables is hashed differently from
a missing column. This causes schema mismatch when a schema is
propagated to another node, because the other node will redact
the schema column completely if the cluster feature isn't enabled,
and an empty value is hashed differently from a missing value.
Store a tombstone instead. Tombstones are removed before
digesting, so they don't affect the outcome.
This change also undoes the changes in 386221da84 ("schema_tables:
handle 'cdc' options") to schema_change_test
test_merging_does_not_alter_tables_which_didnt_change. That change
enshrined the breakage into the test, instead of fixing the root cause,
which was that we added an an extra mutation to the schema (for
cdc options, which were disabled).
Different versions of boost have different rules for what conversions
from cpp_int to smaller intergers are allowed.
We already had a function that worked with all supported versions, but
it was not being use by lua.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200104041028.215153-1-espindola@scylladb.com>
I noticed this while looking at the crashes next is currently
experiencing.
While I have no idea if this fixes the issue, it does avoid broken
future warnings (for no_sharded_instance_exception) in a debug build.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200103201540.65324-1-espindola@scylladb.com>
* seastar 0525bbb08...36cf5c5ff (6):
> memcached: Fix use after free in shutdown
> Revert "task: stop wrapping tasks with unique_ptr"
> task: stop wrapping tasks with unique_ptr
> http: Change exception formating to the generic seastar one
> Merge "Avoid a few calls to ~exception_ptr" from Rafael
> tests: fix core generation with asan
This patch adds a very comprehensive test for the ConditionExpression
feature, i.e., the newer syntax of conditional writes replacing
the old-style "Expected" - for the UpdateItem, PutItem and DeleteItem
operations.
I wrote these tests while closely following the DynamoDB ConditionExpression
documentation, and attempted to cover all conceivable features, subfeatures
and subcases of the ConditionExpression syntax - to serve as a test for a
future support for this feature in Alternator (see issue #5053).
As usual, all these tests pass on AWS DynamoDB, but because we haven't yet
implemented this feature in Alternator, all but one xfail on Alternator.
Refs #5053.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191229143556.24002-1-nyh@scylladb.com>
If Alternator is requested to be enabled on a specific port but the port is
already taken, the boot fails as expected - but the error log is confusing;
It currently looks something like this:
WARN 2019-12-24 11:22:57,303 [shard 0] alternator-server - Failed to set up Alternator HTTP server on 0.0.0.0 port 8000, TLS port 8043: std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
... (many more messages about the server shutting down)
INFO 2019-12-24 11:22:58,008 [shard 0] init - Startup failed: std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
There are two problems here. First, the "WARN" should really be an "ERROR",
because it causes the server to be shut down and the user must see this error.
Second, the final line in the log, something the user is likely to see first,
contains only the ultimate cause for the exception (an address already in use)
but not the information what this address was needed for.
This patch solves both issues, and the log now looks like:
ERROR 2019-12-24 14:00:54,496 [shard 0] alternator-server - Failed to set up Alterna
tor HTTP server on 0.0.0.0 port 8000, TLS port 8043: std::system_error (error system
:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
...
INFO 2019-12-24 14:00:55,056 [shard 0] init - Startup failed: std::_Nested_exception<std::runtime_error> (Failed to set up Alternator HTTP server on 0.0.0.0 port 8000, TLS port 8043): std::system_error (error system:98, posix_listen failed for address 0.0.0.0:8000: Address already in use)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191224124127.7093-1-nyh@scylladb.com>
We don't support yet the ReturnValues option on PutItem, UpdateItem or
DeleteItem operations (see issue #5053), but if a user tries to use such
an option anyway, we silently ignore this option. It's better to fail,
reporting the unsupported option.
In this patch we check the ReturnValues option and if it is anything but
the supported default ("NONE"), we report an error.
Also added a test to confirm this fix. The test verifies that "NONE" is
allowed, and something which is unsupported (e.g., "DOG") is not ignored
but rather causes an error.
Refs #5053.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191216193310.20060-1-nyh@scylladb.com>
These are flags we always want to enable. In particular, we want them
to be used by the bots, but the bots run this script with
--configure-flags, so they were being discarded.
We put the user option later so that they can override the common
options.
Fixes#5505
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Takuya ASADA <syuu@scylladb.com>
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There is no requirement that all notes be placed in a single
PT_NOTE. It looks like recent lld's actually put each section in its
own PT_NOTE.
This change looks for build-id in all PT_NOTE headers.
Fixes#5525
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191227000311.421843-1-espindola@scylladb.com>
rpm compression uses xz, which is painfully slow. Adjust the
compression settings to run on all threads.
The xz utility documentation suggests that 0 threads is
equivalent to all CPUs, but apparently the library interface
(which rpmbuild uses) doesn't think the same way.
Message-Id: <20200101141544.1054176-1-avi@scylladb.com>
In the current code, support for case-sensitive (quoted) user-defined type
names is broken. For example, a test doing:
CREATE TYPE "PHone" (country_code int, number text)
CREATE TABLE cf (pk blob, pn "PHone", PRIMARY KEY (pk))
Fails - the first line creates the type with the case-sensitive name PHone,
but the second line wrongly ends up looking for the lowercased name phone,
and fails with an exception "Unknown type ks.phone".
The problem is in cql3_type_name_impl. This class is used to convert a
type object into its proper CQL syntax - for example frozen<list<int>>.
The problem is that for a user-defined type, we forgot to quote its name
if not lowercase, and the result is wrong CQL; For example, a list of
PHone will be written as list<PHone> - but this is wrong because the CQL
parser, when it sees this expression, lowercases the unquoted type name
PHone and it becomes just phone. It should be list<"PHone">, not list<PHone>.
The solution is for cql3_type_name_impl to use for a user-defined type
its get_name_as_cql_string() method instead of get_name_as_string().
get_name_as_cql_string() is a new method which prints the name of the
user type as it should be in a CQL expression, i.e., quoted if necessary.
The bug in the above test was apparently caused when our code serialized
the type name to disk as the string PHone (without any quoting), and then
later deserialized it using the CQL type parser, which converted it into
a lowercase phone. With this patch, the type's name is serialized as
"PHone", with the quotes, and deserialized properly as the type PHone.
While the extra quotes may seem excessive, they are necessary for the
correct CQL type expression - remember that the type expression may be
significantly more complex, e.g., frozen<list<"PHone">> and all of this,
including the quotes, is necessary for our parser to be able to translate
this string back into a type object.
This patch may cause breakage to existing databases which used case-
sensitive user-defined types, but I argue that these use cases were
already broken (as demonstrated by this test) so we won't break anything
that actually worked before.
Fixes#5544
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200101160805.15847-1-nyh@scylladb.com>
The class in question wants to run its own instances on different
shards, for this sake it keeps reference on sharded self to call
invoke_on() on. There's a handy peering_sharded_service<> in seastar
for the same, using it makes the code nicer and shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191226112401.23960-1-xemul@scylladb.com>
We had a lot of code in a .hh file, that while using templeates, was
only used from creating functions during startup.
This moves it to a new .cc file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200101002158.246736-1-espindola@scylladb.com>
I can't quite figure out how we were trying to write a sstable with
the large data handler already stopped, but the backtrace suggests a
good place to add extra checks.
This patch adds two check. One at the start and one at the end of
sstable::write_components. The first one should give us better
backtraces if the large_data_handler is already stopped. The second
one should help catch some race condition.
Refs: #5470
Message-Id: <20191231173237.19040-1-espindola@scylladb.com>
The standard printer for atomic_cell prints the value as hex,
because atomic_cell does not include the type. Add a type-aware
printer that allows the user to provide the type.
When the product name is other than "scylla", the debian
packaging scripts go over all files that starts with "scylla-"
and change the prefix to be the actual product name.
However, if there are no such files in the directory
the script will fail since the renaming command will
get the wildcard string instrad of an actual file name.
This patch replaces the command with a command with
an equivalent desired effect that only operates on files
if there are any.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20191230143250.18101-1-eliransin@scylladb.com>
Since we merged /usr/lib/scylla with /opt/scylladb, we removed
/usr/lib/scylla and replace it with the symlink point to /opt/scylladb.
However, RPM does not support replacing a directory with a symlink,
we are doing some dirty hack using RPM scriptlet, but it causes
multiple issues on upgrade/downgrade.
(See: https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/)
To minimize Scylla upgrading/downgrade issues on user side, it's better
to keep /usr/lib/scylla directory.
Instead of creating single symlink /usr/lib/scylla -> /opt/scylladb,
we can create symlinks for each setup scripts like
/usr/lib/scylla/<script> -> /opt/scylladb/scripts/<script>.
Fixes#5522Fixes#4585Fixes#4611
The '--builddir' option value is assigned to the "builddir" variable,
which is wrong. The correct variable is "BUILDDIR" so use that instead
to fix the '--builddir' option.
Also, add logging to the script when executing the "dist/redhat_build.rpm.sh"
script to simplify debugging.
Hit the following ubsan error with bootstrap_test:TestBootstrap.manual_bootstrap_test in debug mode:
service/storage_service.cc:3519:37: runtime error: load of value 190, which is not a valid value for type 'bool'
The use site is:
service::storage_service::is_cleanup_allowed(seastar::basic_sstring<char, unsigned int, 15u, true>)::{lambda(service::storage_service&)#1}::operator()(service::storage_service&) const at /local/home/bhalevy/dev/scylla/service/storage_service.cc:3519
While at it, initialize `_initialized` to false as well, just in case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Avoid following UBSAN error:
repair/row_level.cc:2141:7: runtime error: load of value 240, which is not a valid value for type 'bool'
Fixes#5531
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
A Linux machine typically has multiple clocksources with distinct
performances. Setting a high-performant clocksource might result in
better performance for ScyllaDB, so this should be considered whenever
starting it up.
This patch introduces the possibility of enforcing optimized Linux
clocksource to Scylla's setup/start-up processes. It does so by adding
an interactive question about enforcing clocksource setting to scylla_setup,
which modifies the parameter "CLOCKSOURCE" in scylla_server configuration
file. This parameter is read by perftune.py which, if set to "yes", proceeds
to (non persistently) setting the clocksource. On x86, TSC clocksource is used.
Fixes#4474Fixes#5474Fixes#5480
Instances of `variable_specifications` are passed around as
shared_ptr's, which are redundant in this case since the class
is marked as `final`. Use `lw_shared_ptr` instead since we know
for sure it's not a polymorphic pointer.
Tests: unit(debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191225232853.45395-1-pa.solodovnikov@scylladb.com>
If any two directories of data/commitlog/hints/view_hints
are the same we still end up running verify_owner_and_mode
and disk_sanity(check_direct_io_support) in parallel
on the same directoriea and hit #5510.
This change uses std::set rather than std::vector to
collect a unique set of directories that need initialization.
Fixes#5510
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191225160645.2051184-1-bhalevy@scylladb.com>
db::commitlog::segment::batch_cycle() assumes that after a write
for a certain position completes (as reported by
_pending_ops.wait_for_pending()) it will also be flushed, but this is
true only if writing and flushing are atomic wrt _pending_ops lock.
It usually is unless flush_after is set to false when cycle() is
called. In this case only writing is done under the lock. This
is exactly what happens when a segment is closed. Flush is skipped
because zero header is added after the last entry and then flushed, but
this optimization breaks batch_cycle() assumption. Fix it by flushing
after the write atomically even if a segment is being closed.
Fixes#5496
Message-Id: <20191224115814.GA6398@scylladb.com>
The hints and view_hints directory has per-shard sub-dirs,
and the directories code tries to create, check and lock
all of them, including the base one.
The manipulations in question are excessive -- it's enough
to check and lock either the base dir, or all the per-shard
ones, but not everything. Let's take the latter approach for
its simplicity.
Fixes#5510
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looks-good-to: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223142429.28448-1-xemul@scylladb.com>
Pekka Enberg <penberg@scylladb.com> wrote:
> Image might not be present, but the subsequent "docker run" command will automatically pull it.
Just letting "docker run" fail produces kinda confusing error message,
referring to docker help, but the we want to provide the user
with our own help, so still fail early, just also try to pull the image
if "docker image inspect" failed, indicating it's not present locally.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223085219.1253342-4-bhalevy@scylladb.com>
Suggested-by: Pekka Enberg <penberg@scylladb.com>
> This will print all the available Docker images,
> many (most?) of them completely unrelated.
> Why not just print an error saying that no image was specified,
> and then perhaps print usage.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223085219.1253342-3-bhalevy@scylladb.com>
Add dbuild dependency on python3-colorama,
which will be used in test.py instead of a hand-made palette.
[avi: update tools/toolchain/image]
Message-Id: <20191223125251.92064-2-kostja@scylladb.com>
In this place we only need to know the number of endpoints,
while current code additionally shuffles them before counting.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two _identical_ methods in token_metadata class:
get_all_endpoints_count() and number_of_endpoints().
The former one is used (called) the latter one is not used, so
let's remove it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This greatly helps to narrow down the source of schema digest mismatch
between nodes. Intented use is to enable this logger on disagreeing
nodes and trigger schema digest recalculation and observe which
mutations differ in digest and then examine their content.
Message-Id: <1574872791-27634-1-git-send-email-tgrabiec@scylladb.com>
In commit b463d7039c (repair: Introduce
get_combined_row_hash_response), working_row_buf_nr is returned in
REPAIR_GET_COMBINED_ROW_HASH in addition to the combined hash. It is
scheduled to be part of 3.1 release. However it is not backported to 3.1
by accident.
In order to be compatible between 3.1 and 3.2 repair. We need to drop
the working_row_buf_nr in 3.2 release.
Fixes: #5490
Backports: 3.2
Tests: Run repair in a mixed 3.1 and 3.2 cluster
Changes summary:
* make `cql3::result_set` movable-only
* change signature of `cql3::result::result_set` to return by cref
* adjust available call sites to the aforementioned method to accept cref
Motivation behind this change is elimination of dangerous API,
which can easily set a trap for developers who don't expect that
result_set would be returned by value.
There is no point in copying the `result_set` around, so make
`cql3::result::result_set` to cache `result_set` internally in a
`unique_ptr` member variable and return a const reference so to
minimize unnecessary copies here and there.
Tests: unit(debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191220115100.21528-1-pa.solodovnikov@scylladb.com>
So a higher level component using the validator to validate a stream can
catch only validation errors, and let any other incidental exception
through.
This allows building data correctors on top of the
`mutation_fragment_stream_validator`, by filtering a fragment stream
through a validator, catching invalid fragment stream exceptions and
dropping the respective fragments from the stream.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191220073443.530750-1-bdenes@scylladb.com>
This reverts commit 237ba74743. While it
works for the scylla executable, it fails for iotune, which is built
by seastar. It should be reinstated after we pass the correct link
parameters to the seastar build system.
"
These series solves an issue with scylla_setup and prevent it from
waiting forever if housekeeping cannot look for the new Scylla version.
Fixes#5302
It should be backported to versions that support offline installations.
"
* 'scylla_setup_timeout' of git://github.com/amnonh/scylla:
scylla_setup: do not wait forever if no reply is return housekeeping
scylla_util.py: Add optional timeout to out function
Having a long path allows patchelf to change the interpreter without
changing the PT_LOAD headers and therefore without moving the
build-id out of the first page.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191213224803.316783-1-espindola@scylladb.com>
Suppose we have a multi-dc setup (e.g. 9 nodes distributed across
3 datacenters: [dc1, dc2, dc3] -> [3, 3, 3]).
When a query that uses LWT is executed with LOCAL_SERIAL consistency
level, the `storage_proxy::get_paxos_participants` function
incorrectly calculates the number of required participants to serve
the query.
In the example above it's calculated to be 5 (i.e. the number of
nodes needed for a regular QUORUM) instead of 2 (for LOCAL_SERIAL,
which is equivalent to LOCAL_QUORUM cl in this case).
This behavior results in an exception being thrown when executing
the following query with LOCAL_SERIAL cl:
INSERT INTO users (userid, firstname, lastname, age) VALUES (0, 'first0', 'last0', 30) IF NOT EXISTS
Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl LOCAL_SERIAL. Requires 5, alive 3" info={'required_replicas': 5, 'alive_replicas': 3, 'consistency': 'LOCAL_SERIAL'}
Tests: unit(dev), dtest(consistency_test.py)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191216151732.64230-1-pa.solodovnikov@scylladb.com>
The actual buffer is now in a member called 'data'. Leave the old
`dummy.dummy` and `dummy` as fall-back. This seems to change every
Fedora release.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191218153544.511421-1-bdenes@scylladb.com>
Schema is node-global, update_schema_version_and_announce() updates
all shards. We don't need to recalculate it from every shard, so
install the listeners only on shard 0. Reduces noise in the logs.
Message-Id: <1574872860-27899-1-git-send-email-tgrabiec@scylladb.com>
The option in question apparently does not work, several sharded objects
are start()-ed (and thus instanciated) in join_roken_ring, while instances
themselves of these objects are used during init of other stuff.
This leads to broken seastar local_is_initialized assertion on sys_dist_ks,
but reading the code shows more examples, e.g. the auth_service is started
on join, but is used for thrift and cql servers initialization.
The suggestion is to remove the option instead of fixing. The is_joined
logic is kept since on-start joining still can take some time and it's safer
to report real status from the API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191203140717.14521-1-xemul@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5366 from Calle Wilund:
Moves schema creation/alter/drop awareness to use new "before" callbacks from
migration manager, and adds/modifies log and streams table as part of the base
table modification.
Makes schema changes semi-atomic per node. While this does not deal with updates
coming in before a schema change has propagated cluster, it now falls into the
same pit as when this happens without CDC.
Added side effect is also that now schemas are transparent across all subsystems,
not just cql.
Patches:
cdc_test: Add small test for altering base schema (add column)
cdc: Handle schema changes via migration manager callbacks
migration_manager: Invoke "before" callbacks for table operations
migration_listener: Add empty base class and "before" callbacks for tables
cql_test_env: Include cdc service in cql tests
cdc: Add sharded service that does nothing.
cdc: Move "options" to separate header to avoid to much header inclusion
cdc: Remove some code from header
* seastar 00da4c8760...0525bbb08f (7):
> future: Simplify future_state_base::any move constructor
> future: don't create temporary tuple on future::get().
> future: don't instantiate new future on future::then_wrapped().
> future: clean-up the Result handling in then_wrapped().
> Merge "Fix core dumps when asan is enabled" from Rafael
> future: Move ignore to the base class
> future: Don't delete in ignore
Currently `SCYLLA-VERSION-GEN` is not a dependency of any target and
hence changes done to it will not be picked up by ninja. To trigger a
rebuild and hence version changes to appear in the `scylla` target
binary, one has to do `touch configure.py`. This is counter intuitive
and frustrating to people who don't know about it and wonder why their
changed version is not appearing as the output of `scylla --version`.
This patch makes `SCYLLA-VERSION-GEN` a dependency of `build.ninja,
making the `build.ninja` target out-of-date whenever
`SCYLLA-VERSION-GEN` is changed and hence will trigger a rerun of
`configure.py` when the next target is built, allowing a build of e.g.
`scylla` to pick up any changes done to the version automatically.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191217123955.404172-1-bdenes@scylladb.com>
"
This patch set rearranges the test files so that
it is now possible to search for tests automatically,
and adds this functionality to test.py
"
* 'test.py.requeue' of ssh://github.com/scylladb/scylla-dev:
cmake: update CMakeLists.txt to scan test/ rather than tests/
test.py: automatically lookup all unit and boost tests
tests: move all test source files to their new locations
tests: move a few remaining headers
tests: move another set of headers to the new test layout
tests: move .hh files and resources to new locations
tests: remove executable property from data_listeners_test.cc
When scylla is installed without a network connectivity, the test if a
newer version is available can cause scylla_setup to wait forever.
This patch adds a limit to the time scylla_setup will wait for a reply.
When there is no reply, the relevent error will be shown that it was
unable to check for newer version, but this will not block the setup
script.
Fixes#5302
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5343 from
Benny Halevy.
Fixes#5340
Hold the sstable_deletion_sem table::move_sstables_from_subdirs to
serialize access to the staging directory. It now synchronizes snapshot,
compaction deletion of sstables, and view_update_generator moving of
sstables from staging.
Tests:
unit (dev) [expect test_user_function_timestamp_return that fails for me locally, but also on master]
snapshot_test.py (dev)
I used the following as a reference:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/virtual/ClientsTable.java
At this moment there is only info about IP, clients outgoing port,
client 'type' (i.e. CQL/thrift/alternator), shard ID and username.
Column `request_count' is NOT present and CK consists of
(`port', `client_type'), contrary to what C*'s has: (`port').
Code that notifies `system.clients` about new connections goes
to top-level files `connection_notifier.*`. Currently only CQL
clients are observed, but enum `client_type` can be used in future
to notify about connections with other protocols.
Hold the _sstable_deletion_sem while moving sstables from the staging directory
so not to move them under the feet of table::snapshot.
Fixes#5340
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Consumer may throw, in this case, break from the loop and retry.
move_sstable_from_staging_in_thread may theoretically throw too,
ignore the error in this case since the sstable was already processed,
individual move failures are already ignored and moving from staging
will be retried upon restart.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for "batch" move of several sstables from staging
to the base directory, allowing the caller to sync the directories
once when all are moved rather than for each one of them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
distributed_loader::probe_file needlessly creates a seastar
thread for it and the next patch will use it as part of
a parallel_for_each loop to move a list of sstables
(and sync the directories once at the end).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We do not yet support the parallel Scan options (TotalSegments, Segment),
as reported in issue #5059. But even before implementing this feature, it
is important that we produce an error if a user attempts to use it - instead
of outright ignoring this parameter. This is what this patch does.
The patch also adds a full test, test_scan.py::test_scan_parallel, for the
parallel scan feature. The test passes on DynamoDB, and still xfails
on Alternator after this patch - but now the Scan request fails immediately
reporting the unsupported option - instead of what the pre-patch code did:
returning the wrong results and the test failing just when the results
do not match the expectations.
Refs #5059.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191217084917.26191-1-nyh@scylladb.com>
"
Only the first patch is needed to fix the undefined behavior, but the
followup ones simplify the memory management around user types.
"
* 'espindola/fix-5193-v2' of ssh://github.com/espindola/scylla:
db: Don't use lw_shared_ptr for user_types_metadata
user_types_metadata: don't implement enable_lw_shared_from_this
cql3: pass a const user_types_metadata& to prepare_internal
db: drop special case for top level UDTs
db: simplify db::cql_type_parser::parse
db: Don't create a reference to nullptr
Add test for loading a schema with a non native type
1. Move tests to test (using singular seems to be a convention
in the rest of the code base)
2. Move boost tests to test/boost, other
(non-boost) unit tests to test/unit, tests which are
expected to be run manually to test/manual.
Update configure.py and test.py with new paths to tests.
Move sstable_test.hh, test_table.hh and cql_assertions.hh from tests/ to
test/lib or test/boost and update dependent .cc files.
Move tests/perf_sstable.hh to test/perf/perf_sstable.hh
Move another small subset of headers to test/
with the same goals:
- preserve bisectability
- make the revision history traceable after a move
Update dependent files.
The plan is to move the unstructured content of tests/ directory
into the following directories of test/:
test/lib - shared header and source files for unit tests
test/boost - boost unit tests
test/unit - non-boost unit tests
test/manual - tests intended to be run manually
test/resource - binary test resources and configuration files
In order to not break git bisect and preserve the file history,
first move most of the header files and resources.
Update paths to these files in .cc files, which are not moved.
We're seeing the following error from test from time to time:
fatal error: in "test_allocation_failure": std::runtime_error: Did not get expected exception from writing too large record
This is not reproducible and the error string does not contain
enough information to figure out what happened exactly, therefore
this patch adds an exception if the call succeeded unexpectedly
and also prints the unexpected exception if one was caught.
Refs #4714
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191215052434.129641-1-bhalevy@scylladb.com>
The is_podman check was depending on `docker -v` printing "podman" in
the output, but that doesn't actually work, since podman prints $0.
Use `docker --help` instead, which will output "podman".
Also return podman's return status, which was previously being
dropped.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
On start there are two things that scylla does on data/commitlog/etc.
dirs: locks and verifies permissions. Right now these two actions are
managed by different approaches, it's convenient to merge them.
Also the introduced in this set directories class makes a ground for
better --workdir option handling. In particular, right now the db::config
entries are modified after options parse to update directories with
the workdir prefix. With the directories class at hands will be able
to stop doing this.
"
* 'br-directories-cleanup' of https://github.com/xemul/scylla:
directories: Make internals work on fs::path
directories: Cleanup adding dirs to the vector to work on
directories: Drop seastar::async usage
directories: Do touch_and_lock and verify sequentially
directories: Do touch_and_lock in parallel
directories: Move the whole stuff into own .cc file
directories: Move all the dirs code into .init method
file_lock: Work with fs::path, not sstring
This reverts commit 4333b37f9e. It breaks upgrades,
and the user question is not informative enough for the user to make a correct
decision.
Fixes#5478.
Fixes#5480.
The unordered_set is turned into vector since for fs::path
there's no hash() method that's needed for set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the only future-able operation remained is the call to
parallel_for_each(), all the rest is non-blocking preparation,
so we can drop the seastar::async and just return the future
from parallel_for_each.
The indendation is now good, as in previous patch is was prepared
just for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to drop the seastar::async() usage.
Currently we have two places that return futures -- calls to
parallel_for_each-s. We can either chain them together or,
since both are working on the same set of directories, chain
actions inside them.
For code simplicity I propose to chain actions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The list of paths that should be touch-and-locked is already
at hands, this shortens the code and makes it slightly faster
(in theory).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order not to pollute the root dir place the code in
utils/ directory, "utils" namespace.
While doing this -- move the touch_and_lock from the
class declaration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The seastar::async usage is tempoarary, added for bisect-safety,
soon it will go away. For this reason the indentation in the
.init method is not "canonical", but is prepared for one-patch
drop of the seastar::async.
The hinted_handoff_enabled arg is there, as it's not just a
parameter on config, it had been parsed in main.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main.cc code that converts sstring to fs::path
will be patched soon, the file_desc::open belongs
to seastar and works on sstrings.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by the transport
code that jumps to a correct shard and re-process incoming message there.
"
* 'gleb/bounce_lwt_request' of github.com:scylladb/seastar-dev:
lwt: take raw lock for entire cas duration
lwt: drop invoke_on in paxos_state prepare and accept
lwt: Process lwt request on a owning shard
storage_service: move start_native_transport into a thread
transport: change make_result to takes a reference to cql result instead of shared_ptr
The implementation of Expected's BEGINS_WITH operator on blobs was
incorrect, naively comparing the base64-encoded strings, which doesn't
work. This patches fixes the code to compare the decoded strings.
The reason why the BEGINS_WITH test missed this bug was that we forgot
to check the blob case and only tested the string case; So this patch
also adds the missing test - which reproduces this bug, and verifies
its fix.
Fixes#5457
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191211115526.29862-1-nyh@scylladb.com>
The LIKE operator requires filtering, so needs_filtering() must check
is_LIKE(). This already happens for partition columns, but it was
overlooked for clustering columns in the initial implementation of
LIKE.
Fixes#5400.
Tests: unit(dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
To be used by dtest as an indicator that endpoint's hints
were drained and hints directory is removed.
Refs #5354
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The user_types_metadata can simply be owned by the keyspace. This
simplifies the code since we never have to worry about nulls and the
ownership is now explicit.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It looks like this was done just to avoid including
user_types_metadata.hh, which seems a bit much considering that it
requires adding specialization to the seastar namespace.
A followup patch will also stop using lw_shared_ptr for
user_types_metadata.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We never modify the user_types_metadata via prepare_internal, so we
can pass it a const reference.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This was originally done in 7f64a6ec4b,
but that commit was reverted in reverted in
8517eecc28.
The revert was done because the original change would call parse_raw
for non UDT types. Unlike the old patch, this one doesn't change the
behavior of non UDT types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The variant of db::cql_type_parser::parse that has a
user_types_metadata argument was only used from the variant that
didn't. This inlines one in the other.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The user_types variable can be null during db startup since we have to
create types before reading the system table defining user types.
This avoids undefined behavior, but is unlikely that it was causing
more serious problems since the variable is only used when creating
user types and we don't create any until after all system tables are
read, in which case the user_types variable is not null.
Fixes#5193
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by transport code
that jumps to a correct shard and re-process incoming message there.
This patch adds comprehensive tests for the ReturnValue parameter of
the write operations (PutItem, UpdateItem, DeleteItem), which can return
pre-write or post-write values of the modified item. The tests are in
a new test file, alternator-test/test_returnvalues.py.
This feature is not yet implemented in Alternator, so all the new
tests xfail on Alternator (and all pass on AWS).
Refs #5053
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191127163735.19499-1-nyh@scylladb.com>
This patch adds tests for Query's "ScanIndexForward" parameter, which
can be used to return items in reversed sort order.
We test that a Limit works and returns the given number of *last* items
in the sort order, and also that such reverse queries can be resumed,
i.e., paging works in the reverse order.
These tests pass against AWS DynamoDB, but fail against Alternator (which
doesn't support ScanIndexForward yet), so it is marked xfail.
Refs #5153.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191127114657.14953-1-nyh@scylladb.com>
The JMX interface is implemented by the scylla-jmx project, not scylla.
Therefore, let's remove this historical reference to MBeans from
storage_proxy.
Message-Id: <20191211121652.22461-1-penberg@scylladb.com>
"
Add --experimental-features -- a vector of features to unlock. Make corresponding changes in the YAML parser.
Fixes#5338
"
* 'vecexper' of https://github.com/dekimir/scylla:
config: Add `experimental_features` option
utils: Add enum_option
* seastar e440e831c8...00da4c8760 (7):
> Merge "reactor: fix iocb pool underflow due to unaccounted aio fsync" from Avi
Fixes#5443.
> install-dependencies.sh: fix arch dependencies
> Merge " rpc: fix use-after-free during rpc teardown vs. rpc server message handling" from Benny
> Merge "testing: improve the observability of abandoned failed futures" from Botond
> rework the fair_queue tester
> directory_test: Update to use run instead of run_deprecated
> log: support fmt 6.0 branch with chrono.h for log
In the calculate_delay() code for view-backlog flow control, we calculate
a delay and cap it at a "budget" - the remaining timeout. This timeout is
measured in milliseconds, but the capping calculation converted it into
microseconds, which overflowed if the timeout is very large. This causes
some tests which enable the UB sanitizer to fail.
We fix this problem by comparing the delay to the budget in millisecond
resolution, not in microsecond resolution. Then, if the calculated delay
is short enough, we return it using its full microsecond resolution.
Fixes#5412
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191205131130.16793-1-nyh@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5453
from Piotr Sarna:
Checking the EQ relation for alternator attributes is usually performed
simply by comparing underlying JSON objects, but sets (SS, BS, NS types)
need a special routine, as we need to make sure that sets stored in
a different order underneath are still equal, e.g:
[1, 3, 2] == [1, 2, 3]
Fixes#5021
Checking the EQ relation for alternator attributes is usually performed
simply by comparing underlying JSON objects, but sets (SS, BS, NS types)
need a special routine, as we need to make sure that sets stored in
a different order underneath are still equal, e.g:
[1, 3, 2] == [1, 2, 3]
Fixes#5021
If a set of mutations contains both an entry that deletes a table
and an entry that adds a table with the same name, it's expected
to be a replacement operation (delete old + create new),
rather than a useless "try to create a table even though it exists
already and then immediately delete the original one" operation.
As such, notifications about the deletions should be performed
before notifications about the creations. The place that originally
suffered from this wrong order is view building - which in this case
created an incorrect duplicated entry in the view building bookkeeping,
and then immediately deleted it, resulting in having old, deprecated
entries with stale UUIDS lying in the build queue and never proceeding,
because the underlying table is long gone.
The issue is fixed by ensuring the order of notifications:
- drops are announced first, view drops are announced before table drops;
- creations follow, table creations are announced before views;
- finally, changes to tables and views are announced;
Fixes#4382
Tests: unit(dev), mv_populating_from_existing_data_during_node_stop_test
Iterate over an array holding all rpm names to see if any
of them is missing from `dist/ami/files`. If they are missing,
look them up in build/redhat/RPMS/x86_64 so that if reloc/build_rpm.sh
was run manually before dist/ami/build_ami.sh we can just collect
the built rpms from its output dir.
If we're still missing any rpms, then run reloc/build_rpm.sh
and copy the required rpms from build/redhat/RPMS/x86_64.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
In swagger 1.2 int is defined as int32.
We originally used int following the jmx definition, in practice
internally we use uint and int64 in many places.
While the API format the type correctly, an external system that uses
swagger-based code generator can face a type issue problem.
This patch replace all use of int in a return type with long that is defined as int64.
Changing the return type, have no impact on the system, but it does help
external systems that use code generator from swagger.
Fixes#5347
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Provide some explanation on prio strings + direction to gnutls manual.
Document client auth option.
Remove confusing/misleading statement on "custom options"
Message-Id: <20191210123714.12278-1-calle@scylladb.com>
Enable existing NOT_CONTAINS test, add NOT_CONTAINS to the list of
recognized operators, implement check_NOT_CONTAINS, and hook it up to
verify_expected_one().
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the user wants to turn on only some experimental features, they
can use this new option. The existing `experimental` option is
preserved for backwards compatibility.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Almost all commands provided by `scylla-gdb.py` are safe to use. The
worst that could happen if they fail is that you won't get the desired
information. There is one notable exception: `scylla thread`. If
anything goes wrong while this command is executed - gdb crashes, a bug
in the command, etc. - there is a good change the process under
examination will crash. Sometimes this is fine, but other times e.g.
when live debugging a production node, this is unacceptable.
To avoid any accidents add documentation to all commands working with
`seastar::thread`. And since most people don't read documentation,
especially when debugging under pressure, add a safety net to the
`scylla thread` command. When run, this command will now warn of the
dangers and will ask for explicit acknowledgment of the risk of crash,
by means of passing an `--iamsure` flag. When this flag is missing, it
will refuse to run. I am sure this will be very annoying but I am also
sure that the avoided crashes are worth it.
As part of making `scylla thread` safe, its argument parsing code is
migrated to `argparse`. This changes the usage but this should be fine
because it is well documented.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191129092838.390878-1-bdenes@scylladb.com>
The ami description attribute is only allowed to be 255
characters long. When build_ami.sh generates an ami, it
generates an ami description which is a concatenation
of all of the componnents version strings. It can
happen that the description string is too long which
eventually causes the ami build to fail. This patch
trims the description string to 255 characters.
It is ok since the individual versions of the components
are also saved in tags attached to the image.
Tests:
1. Reproduced with a long description and
validated that it doesn't fail after the fix.
Fixes#5435
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20191209141143.28893-1-eliransin@scylladb.com>
A Linux machine typically has multiple clocksources with distinct
performances. Setting a high-performant clocksource might result in
better performance for ScyllaDB, so this should be considered whenever
starting it up.
This patch introduces the possibility of enforcing optimized Linux
clocksource to Scylla's setup/start-up processes. It does so by adding
an interactive question about enforcing clocksource setting to scylla_setup,
which modifies the parameter "CLOCKSOURCE" in scylla_server configuration
file. This parameter is read by perftune.py which, if set to "yes", proceeds
to (non persistently) setting the clocksource. On x86, TSC clocksource is
used.
Fixes#4474
This allows us to create/alter/drop log and desc tables "atomically"
with the base, by including these mutations in the original mutation
set, i.e. batch create/alter tables.
Note that population does not happen until types are actually
already put into database (duh), thus there _is_ still a gap
between creating cdc and it being truly usable. This may or may
not need handling later.
A general build system knows about 3 machines:
* build: where the building is running
* host: where the built software will run
* target: the machine the software will produce code for
The target machine is only relevant for compilers, so we can ignore
it.
Until now we could ignore the build and host distinction too. This
patch adds the first difference: don't use host ld_flags when linking
build tools (gen_crc_combine_table).
The reason for this change is to make it possible to build with
-Wl,--dynamic-linker pointing to a path that will exist on the host
machine, but may not exist on the build machine.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191207030408.987508-1-espindola@scylladb.com>
Empty base type makes for less boiler plate in implementations.
The "before" callbacks are for listeners who need to potentially
react/augment type creation/alteration _before_ actually
committing type to schema tables (and holding the semaphore for this).
I.e. it is for cdc to add/modify log/desc tables "atomically" with base.
The vm.swappiness sysctl controls the kernel's prefernce for swapping
anonymous memory vs page cache. Since Scylla uses very large amounts
of anonymous memory, and tiny amounts of page cache, the correct setting
is to prefer swapping page cache. If the kernel swaps anonymous memory
the reactor will stall until the page fault is satisfied. On the other
hand, page cache pages usually belong to other applications, usually
backup processes that read Scylla files.
This setting has been used in production in Scylla Cloud for a while
with good results.
Users can opt out by not installing the scylla-kernel-conf package
(same as with the other kernel tunables).
* seastar 166061da3...e440e831c (8):
> Fail tests on ubsan errors
> future: make a couple of asserts more strict
> future: Move make_ready out of line
> config: Do not allow zero rates
Fixes#5360
> future: add new state to avoid temporaries in get_available_state().
> future: avoid temporary future_state on get_available_state().
> future: inline future::abandoned
> noncopyable_function: Avoid uninitialized warning on empty types
Previously, scylla used min/max(blob)->blob overload for collections,
tuples and UDTs; effectively making the results being printed as blobs.
This PR adds "dynamically"-typed min()/max() functions for compound types.
These types can be complicated, like map<int,set<tuple<..., and created
in runtime, so functions for them are created on-demand,
similarly to tojson(). The comparison remains unchanged - underneath
this is still byte-by-byte weak lex ordering.
Fixes#5139
* jul-stas/5139-minmax-bad-printing-collections:
cql_query_tests: Added tests for min/max/count on collections
cql3: min()/max() for collections/tuples/UDTs do not cast to blobs
This tests new min/max function for collections and tuples. CFs
in test suite were named according to types being tested, e.g.
`cf_map<int,text>' what is not a valid CF name. Therefore, these
names required "escaping" of invalid characters, here: simply
replacing with '_'.
Before:
cqlsh> insert into ks.list_types (id, val) values (1, [3,4,5]);
cqlsh> select max(val) from ks.list_types;
system.max(val)
------------------------------------------------------------
0x00000003000000040000000300000004000000040000000400000005
After:
cqlsh> select max(val) from ks.list_types;
system.max(val)
--------------------
[3, 4, 5]
This is accomplished similarly to `tojson()`/`fromjson()`: functions
are generated on demand from within `cql3::functions::get()`.
Because collections can have a variety of types, including UDTs
and tuples, it would be impossible to statically define max(T t)->T
for every T. Until now, max(blob)->blob overload was used.
Because `impl_max/min_function_for` is templated with the
input/output type, which can be defined in runtime, we need type-erased
("dynamic") versions of these functors. They work identically, i.e.
they compare byte representations of lhs and rhs with
`bytes::operator<`.
Resolves#5139
If you merge a pull request that contains multiple patches via
the github interface, it will document itself as the committer.
Work around this brain damage by using the command line.
Introduce a new verb dedicated for receiving and sending hints: HINT_MUTATION. It is handled on the streaming connection, which is separate from the one used for handling mutations sent by coordinator during a write.
The intent of using a separate connection is to increase fairness while handling hints and user requests - this way, a situation can be avoided in which one type of requests saturate the connection, negatively impacting the other one.
Information about new RPC support is propagated through new gossip feature HINTED_HANDOFF_SEPARATE_CONNECTION.
Fixes#4974.
Tests: unit(release)
To make gms::inet_address::to_string() similar in output to origin.
The sole purpose being quick and easy fix of API/JMX ipv6
formatting of endpoints etc, where strings are used as lexical
comparisons instead of textual representation.
A better, but more work, solution is to fix the scylla-jmx
bridge to do explicit parse + re-format of addresses, but there
are many such callpoints.
An even better solution would be to fix nodetool to not make this
mistake of doing lexical comparisons, but then we risk breaking
merge compatibility. But could be an option for a separate
nodeprobe impl.
Message-Id: <20191204135319.1142-1-calle@scylladb.com>
Currently query_options objects is passed to a trace stopping function
which makes it mandatory to make them alive until the end of the
query. The reason for that is to add prepared statement parameters to
the trace. All other query options that we want to put in the trace are
copied into trace_state::params_values, so lets copy prepared statement
parameters there too. Trace enabled case will become a little bit more
expensive but on the other hand we can drop a continuation that holds
query_options object alive from a fast path. It is safe to drop the call
to stop_foreground_prepared() here since The tracing will be stopped
in process_request_one().
Message-Id: <20191205102026.GJ9084@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5381 by
Peng Jian, fixing multiple small issues with Redis:
* Rename the options related to Redis API, and describe them clearly.
* Rename redis_transport_port to redis_port
* Rename redis_transport_port_ssl to redis_ssl_port
* Rename redis_default_database_count to redis_database_count
* Remove unnecessary option enable_redis_protocol
* Modify the default value of opition redis_read_consistency_level and redis_write_consistency_level to LOCAL_QUORUM
* Fix the DEL command: support to delete mutilple keys in one command.
* Fix the GET command: return the empty string when the required key is not exists.
* Fix the redis-test/test_del_non_existent_key: mark xfail.
On aarch64, asan detected a use-after-move. It doesn't happen on x86_64,
likely due to different argument evaluation order.
Fix by evaluating full_slice before moving the schema.
Note: I used "auto&&" and "std::move()" even though full_slice()
returns a reference. I think this is safer in case full_slice()
changes, and works just as well with a reference.
Fixes#5419.
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
CREATE INDEX index1 ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
skip-read-validation -node 127.0.0.1;
Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.
Refs #5409Fixes#4615Fixes#5418
This test execution time dominates by a serious margin
test execution time in dev/release mode: reducing its
execution time improves the test.py turnaround by over 70%.
Message-Id: <20191204135315.86374-2-kostja@scylladb.com>
Currently, 'scylla thread' uses arch_prctl() to extract the value of
fsbase, used to reference thread local variables. gdb 8 added support
for directly accessing the value as $fs_base, so use that instead. This
works from core dumps as well as live processes, as you don't need to
execute inferior functions.
The patch is required for debugging threads in core dumps, but not
sufficient, as we still need to set $rip and $rsp, and gdb still[1]
doesn't allow this.
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=9370
The feature introduced by this commit declares that hints can be sent
using the new dedicated RPC verb. Before using the new verb, nodes need
to know if other nodes in the cluster will be able to handle the new
RPC verb.
Introduce a new verb dedicated for receiving and sending hints:
HINT_MUTATION. It is handled on the streaming connection, which is
separate from the one used for handling mutations sent by coordinator
during a write.
The intent of using a separate connection is to increase fariness while
handling hints and user requests - this way, a situation can be avoided
in which one type of requests saturate the connection, negatively
impacting the other one.
"
In several cases in distributed testing (dtest) we trigger compaction using nodetool compact assuming that when it is done, it is indeed really done.
However, the way compaction is currently implemented in scylla, it may leave behind some background tasks to delete the old sstables that were compacted.
This commit changes major compaction (triggered via the ss::force_keyspace_compaction api) so it would wait on the background deletes and will return only when they finish.
Fixes#4909
Tests: unit(dev), nodetool_refresh_with_data_perms_test, test_nodetool_snapshot_during_major_compaction
"
We may able to use chrony setup script on future version of RHEL/CentOS,
it better to run chrony setup when RHEL version >= 8, not only 8.
Note that on Fedora it still provides ntp/ntpdate package, so we run
ntp setup on it for now. (same on debian variants)
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191203192812.5861-1-syuu@scylladb.com>
`segment_manager' now uses a decorated version of `timed_out_error'
with hardcoded name. On the other hand `region_group' uses named
`on_request_expiry' within its `expiring_fifo'.
Fixes#5211
In 79935df959 replay apply-call was
changed from one with no continuation to one with. But the frozen
mutation arg was still just lambda local.
Change to use do_with for this case as well.
Message-Id: <20191203162606.1664-1-calle@scylladb.com>
Exception messages contain semaphore's name (provided in ctor).
This affects the queue overflow exception as well as timeout
exception. Also, custom throwing function in ctor was changed
to `prethrow_action', i.e. metrics can still be updated there but
now callers have no control over the type of the exception being
thrown. This affected `restricted_reader_max_queue_length' test.
`reader_concurrency_semaphore'-s docs are updated accordingly.
In a build configured with --debuginfo 0 the scylla binary still ends
up with some debug info from the libraries that are statically linked
in.
We should avoid compiling subprojects (including seastar) with debug
info when none is needed, but this at least avoids it showing up in
the binary.
The main motivation for this is that it is confusing to get a binary
with *some* debug info in it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191127215843.44992-1-espindola@scylladb.com>
This series refactors the collection de/serialization code to use
fragmented buffers, avoiding the large allocations and the associated
pains when working with large collections. Currently all operations that
involve collections require deserializing them, executing the operation,
then serializing them again to their internal storage format. The
de/serialization operations happen in linearized buffers, which means
that we have to allocate a buffer large enough to hold the *entire*
collection. This can cause immense pressure on the memory allocator,
which, in the face of memory fragmentation, might be unable to serve the
allocation at all. We've seen this causing all sorts of nasty problems,
including but not limited to: failing compactions, failing memtable
flush, OOM crash and etc.
Users are strongly discouraged from using large collections, yet they
are still a fact of life and have been haunting us since forever.
The proper solution for these problems would be to come up with an
in-memory format for collections, however that is a major effort, with a
lot of unknowns. This is something we plan on doing at some point but
until it happens we should make life less painful for those with large
collections.
The goal of this series is to avoid the need of allocating these large
buffers. Serialization now happens into a `bytes_ostream` which
automatically fragments the values internally. Deserialization happens
with `utils::linearizing_input_stream` (introduced by this series), which
linearizes only the individual collection cells, but not the entire
collection.
An important goal of this series was to introduce the least amount of
risk, and hence the least amount of code. This series does not try to
make a revolution and completely revamp and optimize the
de/serialization codepaths. These codepaths have their days numbered so
investing a lot of effort into them is in vain. We can apply incremental
optimizations where we deem it necessary.
Fixes: #5341
Support to delete multiple keys in one DEL command.
The feature of returning number of the really deleted keys is still not supported.
Return empty string to client for GET command when the required key is not exists.
Fixes: #5334
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Rename option redis_transport_port to redis_port, which the redis transport listens on for clients.
Rename option redis_transport_port_ssl to redis_ssl_port, which the redis TLS transport listens on for clients.
Rename option redis_database_count. Set the redis dabase count.
Rename option redis_keyspace_opitons to redis_keyspace_replication_strategy_options. Set the replication strategy for redis keyspace.
Remove option enable_redis_protocol, which is unnecessary.
Fixes: #5335
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Commit 96009881d8 added diffutils to the dependencies via
Seastar's install-dependencies.sh, after it was inadvertantly
dropped in 1164ff5329 (update to Fedora 31; diffutils is no
longer brought in as a side effect of something else).
Regenerate the image to include diffutils.
Ref #5401.
This patch added subtests for EOF process, it reads and writes the socket
directly by using protocol cmds.
We can add more tests in future, tests with Redis module will hide some
protocol error.
Signed-off-by: Amos Kong <amos@scylladb.com>
podman needs to relabel directories in exactly the same cases docker
does. The difference is that podman cannot relabel /tmp.
The reason it was working before is that in practice anyone using
dbuild has already relabeled any directories that need relabeling,
with the exception of /tmp, since it is recreated on every boot.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191201235614.10511-2-espindola@scylladb.com>
Use `utils::linearizing_input_stream` for the deserizalization of the
collection. Allows for avoiding the linearization of the entire cell
value, instead only linearizing individual values as they are
deserialized from the buffer.
`linearizing_input_stream` allows transparently reading linearized
values from a fragmented buffer. This is done by linearizing on-the-fly
only those read values that happen to be split across multiple
fragments. This reduces the size of the largest allocation from the size
of the entire buffer (when the entire buffer is linearized) to the size
of the largest read value. This is a huge gain when the buffer contains
loads of small objects, and modest gains when the buffer contains few
large objects. But the even in the worst case the size of the largest
allocation will be less or equal compared to the case where the entire
buffer is linearized.
This stream is planned to be used as glue code between the fragmented
cell value and the collection deserialization code which expects to be
reading linearized values.
Currently the loop which writes the data from the fragmented origin to
the destination, moves to the next chunk eagerly after writing the value
of the current chunk, if the current chunk is exhausted.
This presents a problem when we are writing the last piece of data from
the last chunk, as the chunk will be exhausted and we eagerly attempt to
move to the next chunk, which doesn't exist and dereferencing it will
fail. The solution is to not be eager about moving to the next chunk and
only attempt it if we actually have more data to write and hence expect
more chunks.
The presence of `const_iterator` seems to be a requirement as well
although it is not part of the concept. But perhaps it is just an
assumption made by code using it.
Not just bytes::output_iterator. Allow writing into streams other than
just `bytes`. In fact we should be very careful with writing into
`bytes` as they require potentially large contiguous allocations.
The `write()` method is now templatized also on the type of its first
argument, which now accepts any CharOutputIterator. Due to our poor
usage of namespace this now collides with `write` defined inside
`db/commitlog/commitlog.cc`. Luckily, the latter doesn't really have to
be templatized on the data type it reads from, and de-templatizing it
resolves the clash.
Currently interactive RAID setup prompt does not list virtio-blk devices due to
following reasons:
- We fail matching '-p' option on 'lsblk --help' output since misusage of
regex functon, list_block_devices() always skipping to use lsblk output.
- We don't check existance of /dev/vd* when we skipping to use lsblk.
- We mistakenly excluded virtio-blk devices on 'lsblk -pnr' output using '-e'
option, but we actually needed them.
To fix the problem we need to use re.search() instead of re.match() to match
'-p' option on 'lsblk --help', need to add '/dev/vd*' on block device list,
then need to stop '-e 252' option on lsblk which excludes virtio-blk.
Additionally, it better to parse 'TYPE' field of lsblk output, we should skip
'loop' devices and 'rom' devices since these are not disk devices.
Fixes#4066
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191201160143.219456-1-syuu@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5392 from
Dejan Mircevski.
Refs #5034
The patches:
alternator: Implement LE operator in Expected
alternator: Implement GE operator in Expected
alternator: Make cmp diagnostic a value, not funct
utils: Add operator<< for big_decimal
alternator: Implement BETWEEN operator in Expected
All check_compare diagnostics are static strings, so there's no need
to call functions to get them. Instead of a function, make diagnostic
a simple value.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"Fix two problem in scylla_io_setup:
- Problem 1: paths of default directories is invalid, introduced by
commit 5ec1915 ("scylla_io_setup: assume default directories under
/var/lib/scylla").
- Problem 2: wrong path join, introduced by commit 31ddb21
("dist/common/scripts: support nonroot mode on setup scripts").
Fix a problem in scylla_io_setup, scylla_fstrim and scylla_blocktune.py:
- Fixed default scylla directories when they aren't assigned in
scylla.yaml"
Fixes#5370
Reviewed-by: Pavel Emelyanov <xemul@scylladb.com>
* 'scylla_io_setup' of git://github.com/amoskong/scylla:
use parse_scylla_dirs_with_default to get scylla directories
scylla_io_setup: fix data_file_directories check
scylla_util: introduce helper to process the default scylla directories
scylla_util: get workdir by datadir() if it's not assigned in scylla.yaml
scylla_io_setup: fix path join of default scylla directories
Use asyncio as a more modern way to work with concurrency,
Process signals in an event loop, terminate all outstanding
tests before exiting.
Breaking change: this commit requires Python 3.7 or
newer to run this script. The patch adds a version
check and a message to enforce it.
Similar to trace_state keep shared_ptr<tracing> _local_tracing_ptr
in one_session_records when constructed so it can be used
during shutdown.
Fixes#5243
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
_user cannot outlive client_state class instance, so there is no point
in holding it in shared_ptr.
Tested: debug test.py and dtest auth_test.py
Message-Id: <20191128131217.26294-5-gleb@scylladb.com>
Only do_stop_rpc_server uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.
Message-Id: <20191128131217.26294-3-gleb@scylladb.com>
Only do_stop_native_transport() uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.
Message-Id: <20191128131217.26294-2-gleb@scylladb.com>
* seastar 6f0ef32514...5c25de907a (7):
> shared_future: Fix crash when all returned futures time out
Fixes#5322.
> future: don't create temporaries on get_value().
> reactor: lower the default stall threshold to 200ms
> reactor: Simplify network initialization
> reactor: Replace most std::function with noncopyable_function
> futures: Avoid extra moves in SEASTAR_TYPE_ERASE_MORE mode
> inet_address: Make inet_address == operator ignore scope (again)
Merged pull request https://github.com/scylladb/scylla/pull/5311 from
Juliusz Stasiewicz:
This is a partial solution to #5139 (only for two types) because of the
above and because collections are much harder to do. They are coming in
a separate PR.
References #5139. Aggregate functions, like max(), when invoked
on `inet_address' and `time_native_type' used to choose
max(blob)->blob overload, with casting of argument and result to
bytes. This is because appropriate calls to
`aggregate_fcts::make_XXX_function()' were missing. This commit
adds them. Functioning remains the same but now clients see
user-friendly representations of aggregate result, not binary.
Comparing inet addresses without inet::operator< is performed by
trick, where ADL is bypassed by wrapping the name of std::min/max
and providing an overload of wrapper on inet type.
Currently we support to assign workdir from scylla.yaml, and we use many
hardcode '/var/lib/scylla' in setup scripts.
Some setup scripts get scylla directories by parsing scylla.yaml, introduced
parse_scylla_dirs_with_default() that adds default values if scylla directories
aren't assigned in scylla.yaml
Signed-off-by: Amos Kong <amos@scylladb.com>
Currently we are checking an invalid path of some default scylla directories,
the directories don't exist, so the tune will always be skipped. It caused by
two problem.
Problem 1: paths of default directories is invalid
Introduced by commit 5ec191536e, we try to tune some scylla default directories
if they exist. But the directory paths we try are wrong.
For example:
- What we check: /var/lib/scylla/commitlog_directory
- Correct one: /var/lib/scylla/commitlog
Problem 2: wrong path join
Introduced by commit 31ddb2145a, default_path might be replaced from
'/var/lib/scylla/' to '/var/lib/scylla'.
Our code tries to check an invalid path that is wrongly join, eg:
'/var/lib/scyllacommitlog'
Signed-off-by: Amos Kong <amos@scylladb.com>
The default values of data_file_directories and commitlog_directory were
commented by commit e0f40ed16a. It causes scylla_util.py:get_scylla_dirs() to
fail in checking the values.
This patch changed get_scylla_dirs() to return default data/commitlog
directories if they aren't set.
Fixes#5358
Reviewed-by: Pavel Emelyanov <xemul@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
Add a test, test_query.py::test_query_limit, to verify that the Limit
parameter correctly limits the number of rows returned by the Query.
This was supposed to already work correctly - but we never had a test for
it. As we hoped, the test passes (on both Alternator and DynamoDB).
Another test, test_query.py::test_query_limit_paging, verifies that
paging can be done with any setting of Limit. We already had tests
for paging of the Scan operation, but not for the Query operation.
Refs #5153
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is a comprehensive test for the "Select" parameter of Query and Scan
operations, but only for the base-table case, not index, so another future
patch should add similar tests in test_gsi.py and test_lsi.py as well.
The main use of the Select parameter is to allow returning just the count
of items, instead of their content, but it also has other esoteric options,
all of which we test here.
The test currently succeeds on AWS DynamoDB, demonstrating that the test
is correct, but fails on Alternator because the "Select" parameter is not
yet supported. So the test is marked xfail.
Refs #5058
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently the command tries to read all seastar smp queues in its
initialization code in the constructor. This constructor is run each
time `scylla-gdb.py` is sourced in `gdb` which leads to slowdowns and
sometimes also annoying errors because the sourcing happens in the wrong
context and seastar symbols are not available.
Avoid this by running this initializing code lazily, on the first
invocation.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191127095408.112101-1-bdenes@scylladb.com>
This patchset adds missing "const" function qualifiers throughout
the Scylla code base, which would make code less error-prone.
The changeset incorporates Kostja's work regarding const qualifiers
in the cql code hierarchy along with a follow-up patch addressing the
review comment of the corresponding patch set (the patch subject is
"cql: propagate const property through prepared statement tree.").
The boost 1.67 release notes says
Changed maximum supported year from 10000 to 9999 to resolve various issues
So change the test to use a larger number so that we get an exception
with both boost 1.66 and boost 1.67.
Fixes#5344
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191126180327.93545-1-espindola@scylladb.com>
Docker on Fedora 31 is flakey, and is not supported at all on RHEL 8.
Podman is a drop-in replacement for docker; this series adds support
for using podman in dbuild.
Apart from actually working on Fedora 31 hosts,
podman is nicer in being more secure and not requiring a daemon.
Fixes#5332
As suggested in issue #4586 here is the helper that prints
"shutting down foo" message, then shuts the foo down, then
prints the "[it] was successull" one. In between it catches
the exception (if any) and warns this in logs.
By "then" I mean literally then, not the seastar's then() :)
Fixes: #4586
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:
auto sem = semaphore(0);
will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:
auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});
will present a message with at least some debugging context:
"Semaphore broken: io_concurrency_sem"
It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.
At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.
Refs #4999
Tests: unit(dev), manual: hardcoding a failure in view building code
Currently scylla_io_setup will skip in scylla_setup, because we didn't support
those new instance types.
I manually executed scylla_io_setup, and the scylla-server started and worked
well.
Let's apply this patch first, then check if there is some new problem in
ami-test.
Signed-off-by: Amos Kong <amos@scylladb.com>
cql_statement is a class representing a prepared statement in Scylla.
It is used concurrently during execution, so it is important that its
change is not changed by execution.
Add const qualifier to the execution methods family, throghout the
cql hierarchy.
Mark a few places which do mutate prepared statement state during
execution as mutable. While these are not affecting production today,
as code ages, they may become a source of latent bugs and should be
moved out of the prepared state or evaluated at prepare eventually:
cf_property_defs::_compaction_strategy_class
list_permissions_statement::_resource
permission_altering_statement::_resource
property_definitions::_properties
select_statement::_opts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
---
v2:
- Have stop easrlier so that exception in start/listen do
not prevent prometheu.stop from calling
As suggested in issue #4586 here is the helper that prints
"shutting down foo" message, then shuts the foo down, then
prints the "shutting down foo was successfull". In between
it catches the exception (if any) and warns this in logs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5218
from Piotr Jastrzębski:
Users should be able to decide whether they need preimage or not. There is
already an option for that but it's not respected by the implementation.
This PR adds support for this functionality.
Tests: unit(dev).
Individual patches:
cdc: Don't take storage_proxy as transformer::pre_image_select param
cdc::append_log_mutations: use do_with instead of shared_ptr
cdc::append_log_mutations: fix undefined behavior
cdc: enable preimage in test_pre_image_logging test
cdc: Return preimage only when it's requested
cdc: test both enabled and disabled preimage in test_pre_image_logging
Before stopping the db itself, stop the migration service.
It must be stopped before RPC, but RPC is not stopped yet
itself, so we should be safe here.
Here's the tail of the resulting logs:
INFO 2019-11-20 11:22:35,193 [shard 0] init - shutdown migration manager
INFO 2019-11-20 11:22:35,193 [shard 0] migration_manager - stopping migration service
INFO 2019-11-20 11:22:35,193 [shard 1] migration_manager - stopping migration service
INFO 2019-11-20 11:22:35,193 [shard 0] init - Shutdown database started
INFO 2019-11-20 11:22:35,193 [shard 0] init - Shutdown database finished
INFO 2019-11-20 11:22:35,193 [shard 0] init - stopping prometheus API server
INFO 2019-11-20 11:22:35,193 [shard 0] init - Scylla version 666.development-0.20191120.25820980f shutdown complete.
Also -- stop the mm on drain before the commitlog it stopped.
[Tomasz: mm needs the cl because pulling schema changes from other nodes
involves applying them into the database. So cl/db needs to be
stopped after mm is stopped.]
The drain logs would look like
...
INFO 2019-11-25 11:00:40,562 [shard 0] migration_manager - stopping migration service
INFO 2019-11-25 11:00:40,562 [shard 1] migration_manager - stopping migration service
INFO 2019-11-25 11:00:40,563 [shard 0] storage_service - DRAINED:
and then on stop
...
INFO 2019-11-25 11:00:46,427 [shard 0] init - shutdown migration manager
INFO 2019-11-25 11:00:46,427 [shard 0] init - Shutdown database started
INFO 2019-11-25 11:00:46,427 [shard 0] init - Shutdown database finished
INFO 2019-11-25 11:00:46,427 [shard 0] init - stopping prometheus API server
INFO 2019-11-25 11:00:46,427 [shard 0] init - Scylla version 666.development-0.20191125.3eab6cd54 shutdown complete.
Fixes#5300
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191125080605.7661-1-xemul@scylladb.com>
In get_full_row_hashes_with_rpc_stream and
repair_get_row_diff_with_rpc_stream_process_op which were introduced in
the "Repair switch to rpc stream" series, rx_hashes_nr metrics are not
updated correctly.
In the test we have 3 nodes and run repair on node3, we makes sure the
following metrics are correct.
assertEqual(node1_metrics['scylla_repair_tx_hashes_nr'] + node2_metrics['scylla_repair_tx_hashes_nr'],
node3_metrics['scylla_repair_rx_hashes_nr'])
assertEqual(node1_metrics['scylla_repair_rx_hashes_nr'] + node2_metrics['scylla_repair_rx_hashes_nr'],
node3_metrics['scylla_repair_tx_hashes_nr'])
assertEqual(node1_metrics['scylla_repair_tx_row_nr'] + node2_metrics['scylla_repair_tx_row_nr'],
node3_metrics['scylla_repair_rx_row_nr'])
assertEqual(node1_metrics['scylla_repair_rx_row_nr'] + node2_metrics['scylla_repair_rx_row_nr'],
node3_metrics['scylla_repair_tx_row_nr'])
assertEqual(node1_metrics['scylla_repair_tx_row_bytes'] + node2_metrics['scylla_repair_tx_row_bytes'],
node3_metrics['scylla_repair_rx_row_bytes'])
assertEqual(node1_metrics['scylla_repair_rx_row_bytes'] + node2_metrics['scylla_repair_rx_row_bytes'],
node3_metrics['scylla_repair_tx_row_bytes'])
Tests: repair_additional_test.py:RepairAdditionalTest.repair_almost_synced_3nodes_test
Fixes: #5339
Backports: 3.2
The code was iterating over a collection that was modified
at the same time. Iterators were used for that and collection
modification can invalidate all iterators.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5310 from
Avi Kivity:
This is a minor update as gcc and boost versions did not change. A noteable
update is patchelf 0.10, which adds support to large binaries.
A few minor issues exposed by the update are fixed in preparatory patches.
Patches:
dist: rpm: correct systemd post-uninstall scriptlet
build: force xz compression on rpm binary payload
tools: toolchain: update to Fedora 31
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).
Fix by destroying the coroutine first.
Fixes#5327.
Tests:
- row_cache_test (dev)
Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
By default rpm uses dwz to merge the debug info from various
binaries. Unfortunately, it looks like addr2line has not been updated
to handle this:
// This works
$ addr2line -e build/release/scylla 0x1234567
$ dwz -m build/release/common.debug build/release/scylla.debug build/release/iotune.debug
// now this fails
$ addr2line -e build/release/scylla 0x1234567
I think the issue is
https://sourceware.org/bugzilla/show_bug.cgi?id=23652Fixes#5289
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191123015734.89331-1-espindola@scylladb.com>
By default we were compressing debug info only in release
executables. The idea, if I understand it correctly, is that those are
the ones we ship, so we want a more compact binary.
I don't think that was doing anything useful. The compression is just
gzip, so when we ship a .tar.xz, having the debug info compressed
inside the scylla binary probably reduces the overall compression a
bit.
When building a rpm the situation in amusing. As part of the rpm
build process the debug info is decompressed and extracted to an
external file.
Given that most of the link time goes to compressing debug info, it is
probably a good idea to just skip that.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191123022825.102837-1-espindola@scylladb.com>
Structure the code to be able to introduce futures.
Apply trivial cleanups.
Switch to asyncio and use it to work with processes and
handle signals. Cleanup all processes upon signal.
This patch implements a simple optimization for LWT: it makes PAXOS
prepare phase query locally and return the current value of the modified
key so that a separate query is not necessary. For more details see
patch 6. Patch 1 fixes a bug in next. Patches 2-5 contain trivial
preparatory refactoring.
Current LWT implementation uses at least three network round trips:
- first, execute PAXOS prepare phase
- second, query the current value of the updated key
- third, propose the change to participating replicas
(there's also learn phase, but we don't wait for it to complete).
The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.
To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.
Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.
To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:
latency 99% (ms) bandwidth (rq/s) timeouts (rq/s)
clients before after before after before after
1 2 2 626 637 0 0
5 4 3 2616 2843 0 0
10 3 3 4493 4767 0 0
50 7 7 10567 10833 0 0
100 15 15 12265 12934 0 0
200 48 30 13593 14317 0 0
400 185 60 14796 15549 0 0
600 290 94 14416 15669 0 0
800 568 118 14077 15820 2 0
1000 710 118 13088 15830 9 0
2000 1388 232 13342 15658 85 0
3000 1110 363 13282 15422 233 0
4000 1735 454 13387 15385 329 0
That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
invoke_on() guarantees that captures object won't be destroyed until the
future returned by the invoked function is resolved so there's no need
to move key, token, proposal for calling paxos_state::*_impl helpers.
The test_health_only_works_for_root_path test checks that while Alternator's
HTTP server responds to a "GET /" request with success ("health check"), it
should respond to different URLs with failures (page not found).
One of the URLs it tested was "/..", but unfortunately some versions of
Python's HTTP client canonize this request to just a "/", causing the
request to unexpectedly succeed - and the test to fail.
So this patch just drops the "/.." check. A few other nonsense URLs are
attempted by the test - e.g., "/abc".
Fixes#5321
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
One of the fields still missing in DescribeTable's response (Refs #5026)
was the table's schema - KeySchema and AttributeDefinitions.
This patch adds this missing feature, and enables the previously-xfailing
test test_describe_table_schema.
A complication of this patch is that in a table with secondary indexes,
we need to return not just the base table's schema, but also the indexes'
schema. The existing tests did not cover that feature, so we add here
two more tests in test_gsi.py for that.
One of these secondary-index schema tests, test_gsi_2_describe_table_schema,
still fails, because it outputs a range-key which Scylla added to a view
because of its own implementation needs, but wasn't in the user's
definition of the GSI. I opened a separate issue #5320 for that.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Serialize reference_wrapper<T> as T and make sure is_equivalent<> treats
reference_wrapper<T> wrapped in std::optional<> or std::variant<>, or
std::tuple<> as T.
We need it to avoid copying query::result while serializing
paxos::promise.
Currently even if `-a` or `-s 0` is provided, `scylla task_histogram`
will scan a limited amount of pages due to a bug in the scan loop's stop
condition, which will be trigger a stop once the default sample limit is
reached. Fix the loop by skipping this check when the user wants to scan
all tasks.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191121141706.29476-1-bdenes@scylladb.com>
At least some versions of 'podman logs --follow' hang when the
container eventually exits (also happens with docker on recent
versions). Fortunately, we don't need to use 'podman logs --follow'
and can use the more natural non-detached 'podman run', because
podman does not proxy SIGTERM and instead shuts down the container
when it receives it.
So, to work around the problem, use the same code path in interactive
and non-interactive runs, when podman is in use instead of docker.
With docker, we went to considerable lengths to ensure that
access to mounted volume was done using the calling user, including
supplementary groups. This avoids root-owned files being left around
after a build, and ensures that access to group-shared files (like
/var/cache/ccache) works as expected.
All of this is unnecessary and broken when using podman. Podman
uses a proxy to access files on behalf of the container, so naturally
all access is done using the calling user's identity. Since it remaps
user and group IDs, assigning the host uid/gid is meaningless. Using
--userns host also breaks, because sudo no longer works.
Fix this by making all the uid/gid/selinux games specific to docker and
ignore them when using podman. To preserve the functionality of tools
that depend on $HOME, set that according to the host setting.
The original fix (10f6b125c8) didn't
take into account that if there was a failed memtable flush (Refs
flush) but is not a flushable memtable because it's not the latest in
the memtable list. If that happens, it means no other memtable is
flushable as well, cause otherwise it would be picked due to
evictable_occupancy(). Therefore the right action is to not flush
anything in this case.
Suspected to be observed in #4982. I didn't manage to reproduce after
triggering a failed memtable flush.
Fixes#3717
podman refuses to start with duplicate volumes, which routinely
happen if the toplevel directory is the working directory. Detect
this and avoid the duplicate.
UnitTest class uses juggles with the name 'args' quite a bit to
construct the command line for a unit test, so let's spread
the harness command line arguments from the unit test command line
arguments a bit apart by consistently calling the harness command line
arguments 'options', and unit test command line arguments 'args'.
Rename usage() to parse_cmd_line().
Create unique UnitTest objects in find_tests() for each found match,
including repeat, to ensure each test has its own unique id.
This will also be used to store execution state in the test.
It somewhat stands in the way of using asyncio
This patch also implements a more comprehensive
fix for #5303, since we not only have --repeat, but
run some tests in different configurations, in which
case xml output is also overwritten.
When starting scylla daemon as non-root the initialization fails
because standard /var/lib/scylla is not accessible by regular users.
Making the default dir accessible for user is not very convenient
either, as it will cause conflicts if two or more instances of scylla
are in use.
This problem can be resolved by specifying --commitlog-directory,
--data-file-directories, etc on start, but it's too much typing. I
propose to revive Nadav's --home option that allows to move all the
directories under the same prefix in one go.
Unlike Nadav's approach the --workdir option doesn't do any tricky
manipulations with existing directories. Insead, as Pekka suggested,
the individual directories are placed under the workir if and only
if the respective option is NOT provided. Otherwise the directory
configuration is taken as is regardless of whether its absolute or
relative path.
The values substutution is done early on start. Avi suggested that
this is unsafe wrt HUP config re-read and proper paths must be
resolved on the fly, but this patch doesn't address that yet, here's
why.
First of all, the respective options are MustRestart now and the
substitution is done before HUP handler is installed.
Next, commitlog and data_file values are copied on start, so marking
the options as LiveUpdate won't make any effect.
Finally, the existing named_value::operator() returns a reference,
so returning a calculated (and thus temporary) value is not possible
(from my current understanding, correct me if I'm wrong). Thus if we
want the *_directory() to return calculated value all callers of them
must be patched to call something different (e.g. *_directory.get() ?)
which will lead to more confusion and errors.
Changes v3:
- the option is --workdir back again
- the existing *directory are only affected if unset
- default config doesn't have any of these set
- added the short -W alias
Changes v2:
- the option is --home now
- all other paths are changed to be relative
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191119130059.18066-1-xemul@scylladb.com>
I found these mismatched types while converting some member functions
to standalone functions, since they have to use the public API that
has more type checks.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191120181213.111758-4-espindola@scylladb.com>
Use pkg-config to search for Lua dependencies rather
than hard-code include and link paths.
Avoid using boost internals, not present in earlier
versions of boost.
Reviewed-by: Rafael Avila de Espindola <espindola@scylladb.com>
Message-Id: <20191120170005.49649-1-kostja@scylladb.com>
Use `-Wl,--threads` flag to enable multi-threaded linking when
using `ld.gold` linker.
Additional compilation test is required because it depends on whether
or not the `gold` linker has been compiled with `--enable-threads` option.
This patch introduces a substantial improvement to the link times of
`scylla` binary in release and debug modes (around 30 percent).
Local setup reports the following numbers with release build for
linking only build/release/scylla:
Single-threaded mode:
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:09.30
Multi-threaded mode:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:51.57
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191120163922.21462-1-pa.solodovnikov@scylladb.com>
Merged patch series from Peng Jian, adding optionally-enabled Redis API
support to Scylla. This feature is experimental, and partial - the extent
of this support is detailed in docs/redis/redis.md.
Patches:
Document: add docs/redis/redis.md
redis: Redis API in Scylla
Redis API: graft redis module to Scylla
redis-test: add test cases for Redis API
This is a minor update as gcc and boost versions do not change.
glibc-langpack-en no longer gets pulled in by default. As it is required
by some locale use somewhere, it is added to the explicit dependencies.
Fedora 31 switched the default compression to zstd, which isn't readable
by some older rpm distributions (CentOS 7 in particular). Tell it to use
the older xz compression instead, so packages produced on Fedora 31 can
be installed on older distributions.
The post-uninstall scriptlet requires a parameter, but older versions
of rpm survived without it. Fedora 31's rpm is more strict, so supply
this parameter.
In this document, the detailed design and implementation of Redis API in
Scylla is provided.
v2: build: work around ragel 7 generated code bug (suggested by Avi)
Ragel 7 incorrectly emits some unused variables that don't compile.
As a workaround, sed them away.
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
Scylla has advantage and amazing features. If Redis build on the top of Scylla,
it has the above features automatically. It's achived great progress
in cluster master managment, data persistence, failover and replication.
The benefits to the users are easy to use and develop in their production
environment, and taking avantages of Scylla.
Using the Ragel to parse the Redis request, server abtains the command name
and the parameters from the request, invokes the Scylla's internal API to
read and write the data, then replies to client.
Signed-off-by: Peng Jian, <pengjian.uestc@gmail.com>
Merged patch set by Piotr Dulikowski:
This change corrects condition on which a row was considered expired by its
TTL.
The logic that decides when a row becomes expired was inconsistent with the
logic that decides if a single cell is expired. A single cell becomes expired
when expiry_timestamp <= now, while a row became expired when
expiry_timestamp < now (notice the strict inequality). For rows inserted
with TTL, this caused non-key cells to expire (change their values to null)
one second before the row disappeared. Now, row expiry logic uses non-strict
inequality.
Fixes#4263,
Fixes#5290.
Tests:
unit(dev)
python test described in issue #5290
It is useful to have an option to limit the execution time of a shell
script.
This patch adds an optional timeout parameter, if a parameter will be
provided a command will return and failure if the duration is passed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Merged patch series from Juliusz Stasiewicz:
Welcome to my first PR to Scylla!
The task was intended as a warm-up ("noob") exercise; its description is
here: #4182 Sorry, I also couldn't help it and did some scouting: edited
descriptions of some metrics and shortened few annoyingly long LoC.
Those are typically symptoms of use-after-free or memory corruption in
the program. It's better to catch such error sooner than later.
That situation is also dangerous since if a valid descriptor would
land under the invalid access, not the one which was intended for the
operation, then the operation may be performed on the wrong file and
result in corruption.
Message-Id: <1565206788-31254-1-git-send-email-tgrabiec@scylladb.com>
This change corrects condition on which a row was considered expired by
its TTL.
The logic that decides when a row becomes expired was inconsistent with
the logic that decides if a single cell is expired. A single cell
becomes expired when `expiry_timestamp <= now`, while a row became
expired when `expiry_timestamp < now` (notice the strict inequality).
For rows inserted with TTL, this caused non-key cells to expire (change
their values to null) one second before the row disappeared. Now, row
expiry logic uses non-strict inequality.
Fixes: #4263, #5290.
Tests:
- unit(dev)
- python test described in issue #5290
Currently, we overwrite the same XML output file for each test repeat
cycle. This can cause invalid XML to be generated if the XML contents
don't match exactly for every iteration.
Fix the problem by appending the test repeat cycle in the XML filename
as follows:
$ ./test.py --repeat 3 --name vint_serialization_test --mode dev --jenkins jenkins_test
$ ls -1 *.xml
jenkins_test.release.vint_serialization_test.0.boost.xml
jenkins_test.release.vint_serialization_test.1.boost.xml
jenkins_test.release.vint_serialization_test.2.boost.xml
Fixes#5303.
Message-Id: <20191119092048.16419-1-penberg@scylladb.com>
In a cross-dc large cluster, the receiver node of the gossip SYN message
might be slow to send the gossip ACK message. The ack messages can be
large if the payload of the application state is big, e.g.,
CACHE_HITRATES with a lot of tables. As a result, the unlimited ACK
message can consume unlimited amount of memory which causes OOM
eventually.
To fix, this patch queues the SYN message and handles it later if the
previous ACK message is still being sent. However, we only store the
latest SYN message. Since the latest SYN message from peer has the
latest information, so it is safe to drop the previous SYN message and
keep the latest one only. After this patch, there can be at most 1
pending SYN message and 1 pending ACK message per peer node.
"
This patch series adds only UDF support, UDA will be in the next patch series.
With this all CQL types are mapped to Lua. Right now we setup a new
lua state and copy the values for each argument and return. This will
be optimized once profiled.
We require --experimental to enable UDF in case there is some change
to the table format.
"
* 'espindola/udf-only-v4' of https://github.com/espindola/scylla: (65 commits)
Lua: Document the conversions between Lua and CQL
Lua: Implement decimal subtraction
Lua: Implement decimal addition
Lua: Implement support for returning decimal
Lua: Implement decimal to string conversion
Lua: Implement decimal to floating point conversion
Lua: Implement support for decimal arguments
Lua: Implement support for returning varint
Lua: Implement support for returning duration
Lua: Implement support for duration arguments
Lua: Implement support for returning inet
Lua: Implement support for inet arguments
Lua: Implement support for returning time
Lua: Implement support for time arguments
Lua: Implement support for returning timeuuid
Lua: Implement support for returning uuid
Lua: Implement support for uuid and timeuuid arguments
Lua: Implement support for returning date
Lua: Implement support for date arguments
Lua: Implement support for returning timestamp
...
Add mode_list rule to ninja build and use it by default when searching
for tests in test.py.
Now it is no longer necessary to explicitly specify the test mode when
invoking test.py.
(cherry picked from commit a211ff30c7f2de12166d8f6f10d259207b462d4b)
The goal of this patch is to fix issue #5280, a rather serious Alternator
bug, where Scylla fails to restart when an Alternator table has secondary
indexes (LSI or GSI).
Traditionally, Cassandra allows table names to contain only alphanumeric
characters and underscores. However, most of our internal implementation
doesn't actually have this restriction. So Alternator uses the characters
':' and '!' in the table names to mark global and local secondary indexes,
respectively. And this actually works. Or almost...
This patch fixes a problem of listing, during boot, the sstables stored
for tables with such non-traditional names. The sstable listing code
needlessly assumes that the *directory* name, i.e., the CF names, matches
the "\w+" regular expression. When an sstable is found in a directory not
matching such regular expression, the boot fails. But there is no real
reason to require such a strict regular expression. So this patch relaxes
this requirement, and allows Scylla to boot with Alternator's GSI and LSI
tables and their names which include the ":" and "!" characters, and in
fact any other name allowed as a directory name.
Fixes#5280.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191114153811.17386-1-nyh@scylladb.com>
This document adds information about how fixes are tracked to be
backported into releases and what is the procedure that is followed to
backport those fixes.
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Allow filtering the resolved addresses by a startswith string.
The common use case if for resolving vtable ptrs, when resolving
the output of `find_vptrs` that may be too long for the host
(running gdb) memory size. In this case the number of vtable
ptrs is considerably smaller than the total number of objects
returned by find_ptrs (e.g. 462 vs. 69625 in a OOM core I
examined from scylla --smp=2 --memory=1024M)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
CQL tracing would only report file I/O involving one sstable, even if
multiple sstables were read from during the query.
Steps to reproduce:
create a table with NullCompactionStrategy
insert row, flush memtables
insert row, flush memtables
restart Scylla
tracing on
select * from table
The trace would only report DMA reads from one of the two sstables.
Kudos to @denesb for catching this.
Related issue: #4908
There are ... signs of massive start/stop code rework in the
main() function. While fixing the sub-modules interdependencies
during start/stop I've polished these signs too, so here's the
simplest ones.
Serialize provided partition_key in such a way that the serialized value
will hash to the same token as the original key. This way when system.paxos
table is updated the update is shard local.
Message-Id: <20191114135449.GU10922@scylladb.com>
"
When using INSERT JSON with frozen collection/UDT columns, if the columns were left unspecified or set to null, the statement would create an empty non-null value for these columns instead of using null values as it should have. For example:
cqlsh:b> create table t (k text primary key, l frozen<list<int>>, m frozen<map<int, int>>, s frozen<set<int>>, u frozen<ut>);
cqlsh:b> insert into t JSON '{"k": "insert_json"}';
cqlsh:b> select * from t;
k | l | m | s | u
-------------------+------+------+------+------
insert_json | [] | {} | {} |
This PR fixes this.
Resolves#5246 and closes#5270.
"
* 'frozen-json' of https://github.com/kbr-/scylla:
tests: add null/unset frozen collection/UDT INSERT JSON test
cql3: correctly handle frozen null/unset collection/UDT columns in INSERT JSON
cql3: decouple execute from term binding in user_type::setter
* seastar 75e189c6ba...6f0ef32514 (6):
> Merge "Add named semaphores" from Piotr
> parallel_for_each_state: pass rvalue reference to add_future
> future: Pass rvalue to uninitialized_wrapper::uninitialized_set.
> dependencies: Add libfmt-dev to debian
> log: Fix logger behavior when logging both to stdout and syslog.
> README.md: list Scylla among the projects using Seastar
If CONSISTENCY is set to SERIAL or LOCAL SERIAL, all write requests must
fail according to Cassandra's documentation. However, batched writes
bypass this check. Fix this.
This patch resurrects Cassandra's code validating a consistency level
for CAS requests. Basically, it makes CAS requests use a special
function instead of validate_for_write to make error messages more
coherent.
Note, we don't need to resurrect requireNetworkTopologyStrategy as
EACH_QUORUM should work just fine for both CAS and non-CAS writes.
Looks like it is just an artefact of a rebase in the Cassandra
repository.
The dependencies are provided by the frozen toolchain. If a dependency
is missing, we must update the toolchain rather than rely on build-time
installation, which is not reproducible (as different package versions
are available at different times).
Luckily "dnf install" does not update an already-installed package. Had
that been a case, none of our builds would have been reproducible, since
packages would be updated to the latest version as of the build time rather
than the version selected by the frozen toolchain.
So, to prevent missing packages in the frozen toolchain translating to
an unreproducible build, remove the support for installing dependencies
from reloc/build_reloc.sh. We still parse the --nodeps option in case some
script uses it.
Fixes#5222.
Tests: reloc/build_reloc.sh.
MV backpressure code frees mutation for delayed client replies earlier
to save memory. The commit 2d7c026d6e that
introduced the logic claimed to do it only when all replies are received,
but this is not the case. Fix the code to free only when all replies
are received for real.
Fixes#5242
Message-Id: <20191113142117.GA14484@scylladb.com>
Resharding is responsible for the scheduling the deletion of sstables
resharded, but it was not refreshing the cache of the shards those
sstables belong to, which means cache was incorrectly holding reference
to them even after they were deleted. The consequence is sstables
deleted by resharding not having their disk space freed until cache
is refreshed by a subsequent procedure that triggers it.
Fixes#5261.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20191107193550.7860-1-raphaelsc@scylladb.com>
* 'cql-trivial-cleanup' of ssh://github.com/scylladb/scylla-dev:
cql: rename modification_statement::_sets_a_collection to _selects_a_collection
cql: rename _column_conditions to _regular_conditions
cql: remove unnecessary optional around prefetch_data
"
Use a fixed-size, rather than a dynamically growing
bitset for column mask. This avoids unnecessary memory
reallocation in the most common case.
"
* 'column_set' of ssh://github.com/scylladb/scylla-dev:
schema: pre-allocate the bitset of column_set
schema: introduce schema::all_columns_count()
schema: rename column_mask to column_set
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
- memtable_partition_writes - number of write operations performed
on partitions in memtables,
- memtable_partition_hits - number of write operations performed
on partitions that previously existed in a memtable,
- memtable_row_writes - number of row write operations performed
in memtables,
- memtable_row_hits - number of row write operations that ovewrote
rows previously present in a memtable.
Tests: unit(release)
Merged patch series from Dejan Mircevski. Implements the "LT" and "GT"
operators of the Expected update option (i.e., conditional updates),
and enables the pre-existing tests for them.
Since it contains a precise set of columns, it's more
accurate to call it a set, not a mask. Besides, the name
column_mask is already used for column options on storage
level.
This is merely to avoid confusion: we use _sets prefix to indicate that
there are operations over static/regular columns (_sets_static_columns,
_sets_regular_columns), but _sets_a_collection is set for both operations
and conditions. So let's rename it to _selects_a_collection and add some
comments.
It's weird that modification_statement has _static_conditions for
conditions on static columns and _column_conditions for conditions on
regular columns, as if conditions on static columns are not column
conditions. Let's rename _column_conditions to _regular_conditions to
avoid confusion.
Before this commit, an empty non-null value was created for
frozen collection/UDT columns when an INSERT JSON statement was executed
with the value left unspecified or set to null.
This was incompatible with Cassandra which inserted a null (dead cell).
Fixes#5270.
--pkg option on install.sh is introduced for .deb packaging since it requires
different install directory for each subpackage.
But we actually able to use "debian/tmp" for shared install directory,
then we can specify file owner of the package using .install files.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191030203142.31743-1-syuu@scylladb.com>
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
- memtable_partition_writes - number of write operations performed
on partitions in memtables,
- memtable_partition_hits - number of write operations performed
on partitions that previously existed in a memtable,
- memtable_row_writes - number of row write operations performed
in memtables,
- memtable_row_hits - number of row write operations that ovewrote
rows previously present in a memtable.
Tests: unit(release)
This is just the minimum to pass a value to Lua. Right now you can't
actually do anything with it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This adds support for all integer types. Followup commits will
implement the missing types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This makes it substantially simpler to support both varint and
decimal, which will be implemented in a followup patch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this we support all simple integer types. Followup patches will
implement the missing types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This add a wrapper around the lua interpreter so that function
executions are interruptible and return futures.
With this patch it is possible to write and use simple UDFs that take
and return integer values.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This adds a requires_thread predicate to functions and propagates that
up until we get to code that already returns futures.
We can then use the predicate to decide if we need to use
seastar::async.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This refactors test_schema_digest_does_not_change to also test a
schema with user defined functions and user defined aggregates.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this it is possible to create user defined functions and
aggregates and they are saved to disk and the schema change is
propagated.
It is just not possible to call them yet.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The parser now rejects having both OR REPLACE and IF NOT EXISTS in the
same statement.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This updates UDF syntax to the current specification.
In particular, this removes DETERMINISTIC and adds "CALLED ON NULL
INPUT" and "RETURNS NULL ON NULL INPUT".
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
At some point we should make the function list non static, but this
allows us to write tests for now.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This avoids allocating a std::vector and is more flexible since the
iterator can be passed to erase.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This is a simple wrapper that allows code that is not in the types
hierarchy to visit a data_value.
Will be used by UDF.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This change adds a SCYLLA_REPO_URL argument to Dockerfile, which defines
the RPM repository used to install Scylla from.
When building a new Docker image, users can specify the argument by
passing the --build-arg SCYLLA_REPO_URL=<url> option to the docker build
command. If the argument is not specified, the same RPM repository is
used as before, retaining the old default behavior.
We intend to use this in release engineering infrastructure to specify
RPM repositories for nightly builds of release branches (for example,
3.1.x), which are currently only using the stable RPMs.
Code for check_LT(), check_GT(), etc. will be nearly identical, so
factor it out into a single function that takes a comparator object.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
In 1ca9dc5d47, it was established that the correct way to
base64-decode a JSON value is via string_view, rather than directly
from GetString().
This patch adds a base64_decode(rjson::value) overload, which
automatically uses the correct procedure. It saves typing, ensures
correctness (fixing one incorrect call found), and will come in handy
for future EXPECTED comparisons.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
unwrap_number() is now a public function in serialization.hh instead
of a static function visible only in executor.cc.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Merged patch series from Piotr Sarna:
An otherwise empty partition can still have a valid static column.
Filtering didn't take that fact into account and only filtered
full-fledged rows, which may result in non-matching rows being returned
to the client.
Fixes#5248
"type" label is already in use for the counter type ("derive", "gauge",
etc). Using the same label for "cas" / "non-cas" overwrites it. Let's
instead call the new label "conditional" and use "yes" / "no" for its
value, as suggested by Kostja.
Message-Id: <3082b16e4d6797f064d58da95fb4e50b59ab795c.1572451480.git.vdavydov@scylladb.com>
"
In case when a single reader contributes a stream of fragments and keeps winning over other readers, mutation_reader_merger will enter gallop mode, in which it is assumed that the reader will keep winning over other readers. Currently, a reader needs to contribute 3 fragments to enter that mode.
In gallop mode, fragments returned by the galloping reader will be compared with the best fragment from _fragment_heap. If it wins, the fragment is directly returned. Otherwise, gallop mode ends and merging performed as in general case, which involves heap operations.
In current implementation, when the end of partition is encountered while in gallop mode, the gallop mode is ended unconditionally.
A microbenchmark was added in order to test performance of the galloping reader optimization. A combining reader that merges results from four other readers is created. Each sub-reader provides a range of 32 clustering rows that is disjoint from others. All sub-readers return rows from the same partition. An improvement can be observed after introducing the galloping reader optimization.
As for other benchmarks from the "combined" group, results are pretty close to the old ones. The only one that seems to have suffered slightly is combined.many_overlapping.
Median times from a single run of perf_mutation_readers.combined: (1s run duration, 5 runs per benchmark, release mode)
test name before after improvement
one_row 49.070ns 48.287ns 1.60%
single_active 61.574us 61.235us 0.55%
many_overlapping 488.193us 514.977us -5.49%
disjoint_interleaved 57.462us 57.111us 0.61%
disjoint_ranges 56.545us 56.006us 0.95%
overlapping_partitions_disjoint_rows 127.039us 80.849us 36.36%
Same results, normalized per mutation fragment:
test name before after improvement
one_row 16.36ns 16.10ns 1.60%
single_active 109.46ns 108.86ns 0.55%
many_overlapping 216.97ns 228.88ns -5.49%
disjoint_interleaved 102.15ns 101.53ns 0.61%
disjoint_ranges 100.52ns 99.57ns 0.95%
overlapping_partitions_disjoint_rows 246.38ns 156.80ns 36.36%
Tested on AMD Ryzen Threadripper 2950X @ 3.5GHz.
Tests: unit(release)
Fixes#3593.
"
* '3593-combined_reader-gallop-mode' of https://github.com/piodul/scylla:
mutation_reader: gallop mode microbenchmark
mutation_reader: combined reader gallop tests
mutation_reader: gallop mode for combined reader
mutation_reader: refactor prepare_next
An otherwise empty partition can still have a valid static column.
Filtering didn't take that fact into account and only filtered
full-fledged rows, which may result in non-matching rows being returned
to the client.
Fixes#5248
Update previous results dictionary using the update_metrics method.
It calls metric_source.query_list to get a list of results (similar to discover()) then for each line in the response it updates results dictionary.
New results may be appeneded depending on the do_append parameter (True by default).
Previously, with prometheous, each metric.update called query_list resulting in O(n^2) when all metric were updated, like in the scylla_top dtest - causing test timeout when testing debug build.
(E.g. dtest-debug/216/testReport/scyllatop_test/TestScyllaTop/default_start_test/)
This patch adds "type" label to the following CQL metrics:
inserts
updates
deletes
batches
statements_in_batches
The label is set to "cas" for conditional statements and "non-cas" for
unconditional statements.
Note, for a batch to be accounted as CAS, it is enough to have just one
conditional statement. In this case all statements within the batch are
accounted as CAS as well.
This microbenchmark tests performance of the galloping reader
optimization. A combining reader that merges results from four other
readers is created. Each sub-reader provides a range of 32 clustering
rows that is disjoint from others. All sub-readers return rows from
the same partition. An improvement can be observed after introducing the
galloping reader optimization.
As for other benchmarks from the "combined" group, results are pretty
close to the old ones. The only one that seems to have suffered slightly
is combined.many_overlapping.
Median times from a single run of perf_mutation_readers.combined:
(1s run duration, 5 runs per benchmark, release mode)
test name before after improvement
one_row 49.070ns 48.287ns 1.60%
single_active 61.574us 61.235us 0.55%
many_overlapping 488.193us 514.977us -5.49%
disjoint_interleaved 57.462us 57.111us 0.61%
disjoint_ranges 56.545us 56.006us 0.95%
overlapping_partitions_disjoint_rows 127.039us 80.849us 36.36%
Same results, normalized per mutation fragment:
test name before after improvement
one_row 16.36ns 16.10ns 1.60%
single_active 109.46ns 108.86ns 0.55%
many_overlapping 216.97ns 228.88ns -5.49%
disjoint_interleaved 102.15ns 101.53ns 0.61%
disjoint_ranges 100.52ns 99.57ns 0.95%
overlapping_partitions_disjoint_rows 246.38ns 156.80ns 36.36%
Tested on AMD Ryzen Threadripper 2950X @ 3.5GHz.
In case when a single reader contributes a stream of fragments
and keeps winning over other readers, mutation_reader_merger will
enter gallop mode, in which it is assumed that the reader will keep
winning over other readers. Currently, a reader needs to contribute
3 fragments to enter that mode.
In gallop mode, fragments returned by the galloping reader will be
compared with the best fragment from _fragment_heap. If it wins, the
fragment is directly returned. Otherwise, gallop mode ends and
merging performed as in general case, which involves heap operations.
In current implementation, when the end of partition is encountered
while in gallop mode, the gallop mode is ended unconditionally.
Fixes#3593.
Move out logic responsible for adding readers at partition boundary
into `maybe_add_readers_at_partition_boundary`, and advancing one reader
into `prepare_one`. This will allow to reuse this logic outside
`prepare_next`.
Since seastar::streams are based on future/promise, variadic streams
suffer the same fate as variadic futures - deprecation and eventual
removal.
This patch therefore replaces a variadic stream in commitlog::read_log_file()
with a non-variadic stream, via a helper struct.
Tests: unit (dev)
Recently, scylla memory started to go beyond just providing raw stats
about the occupancy of the various memory pools, to additionally also
provide an overview of the "usual suspects" that cause memory pressure.
As part of this, recently 46341bd63f
added a section of the coordinator stats. This patch continues this
trend and adds a replica section, with the "usual suspects":
* read concurrency semaphores
* execution stages
* read/write operations
Example:
Replica:
Read Concurrency Semaphores:
user sstable reads: 0/100, remaining mem: 84347453 B, queued: 0
streaming sstable reads: 0/ 10, remaining mem: 84347453 B, queued: 0
system sstable reads: 0/ 10, remaining mem: 84347453 B, queued: 0
Execution Stages:
data query stage:
03 "service_level_sg_0" 4967
Total 4967
mutation query stage:
Total 0
apply stage:
03 "service_level_sg_0" 12608
06 "statement" 3509
Total 16117
Tables - Ongoing Operations:
pending writes phaser (top 10):
2 ks.table1
2 Total (all)
pending reads phaser (top 10):
3380 ks.table2
898 ks.table1
410 ks.table3
262 ks.table4
17 ks.table8
2 system_auth.roles
4969 Total (all)
pending streams phaser (top 10):
0 Total (all)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029164817.99865-1-bdenes@scylladb.com>
This patch adds the following per table stats:
cas_prepare_latency
cas_propose_latency
cas_commit_latency
They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed
by Cassandra.
This patch implements accounting of Cassandra's metrics related to
lightweight transactions, namely:
cas_read_latency transactional read latency (histogram)
cas_write_latency transactional write latency (histogram)
cas_read_timeouts number of transactional read timeouts
cas_write_timeouts number of transactional write timeouts
cas_read_unavailable number of transactional read
unavailable errors
cas_write_unavailable number of transactional write
unavailable errors
cas_read_unfinished_commit number of transaction commit attempts
that occurred on read
cas_write_unfinished_commit number of transaction commit attempts
that occurred on write
cas_write_condition_not_met number of transaction preconditions
that did not match current values
cas_read_contention how many contended reads were
encountered (histogram)
cas_write_contention how many contended writes were
encountered (histogram)
Pass contention by reference to begin_and_repair_paxos(), where it is
incremented on every sleep. Rationale: we want to account the total
number of times query() / cas() had to sleep, either directly or within
begin_and_repair_paxos(), no matter if the function failed or succeeded.
Even though every Scylla version has its own scylla-gdb.py, because we
don't backport any fixes or improvements, practically we end up always
using master's version when debugging older versions of Scylla too. This
is made harder by the fact that both Scylla's and its dependencies'
(most notably that of libstdc++ and boost) code is constantly changing
between releases, requiring edits to scylla-gdb.py to make it usable
with past releases.
This patch attempts to make it easier to use scylla-gdb.py with past
releases, more specifically Scylla 3.0. This is achieved by wrapping
problematic lines in a `try: except:` and putting the backward
compatible version in the `except:` clause. These lines have comments
with the version they provide support for, so they can be removed when
said version is not supported anymore.
I did not attempt to provide full coverage, I only fixed up problems
that surfaced when using my favourite commands with 3.0.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029155737.94456-1-bdenes@scylladb.com>
The loop that collects the result of the checksum calculations and logs
any errors. The error logging includes `checksums[0]` which corresponds
to the checksum calculation on the local node. This violates the
assumption of the code following the loop, which assumes that the future
of `checksums[0]` is intact after the loop terminates. However this is
only true when the checksum calculation is successful and is false when
it fails, as in this case the loop extracts the error and logs it. When
the code after the loop checks again whether said calculation failed, it
will get a false negative and will go ahead and attempt to extract the
value, triggering an assert failure.
Fix by making sure that even in the case of failed checksum calculation,
the result of `checksum[0]` is extracted only once.
Fixes: #5238
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029151709.90986-1-bdenes@scylladb.com>
* seastar 2963970f6b...75e189c6ba (7):
> posix-stack: Do auto-resolve of ipv6 scope iff not set for link-local dests
> README.md: Add redpanda and smf to 'Projects using Seastar'
> unix_domain_test: don't assume that at temporary_buffer is null terminated
> socket_address: Use offsetof instead of null pointer
> README: add projects using seastar section to readme
> Adjustments for glibc 2.30 and hwloc 2.0
> Mark future::failed() as const
We may want to change paxos tables format and change internode protocol,
so hide lwt behind experimental flag for now.
Message-Id: <20191029102725.GM2866@scylladb.com>
Currently end of stream validation is done in the destructor,
but the validator may be destructed prematurely, e.g. on
exception, as seen in https://github.com/scylladb/scylla/issues/5215
This patch adds a on_end_of_stream() method explicitly called by
consume_pausable_in_thread. Also, the respective concepts for
ParitionFilter, MutationFragmentFilter and a new on for the
on_end_of_stream method were unified as FlattenedConsumerFilter.
Refs #5215
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 506ff40bd447f00158c24859819d4bb06436c996)
There are a few issues at the CQL layer, because of which the result of
a CAS request execution may differ between Scylla and Cassandra. Mostly,
it happens when static columns are involved. The goal of this patch set
is to fix these issues, thus making Scylla's implementation of CAS yield
the same results as Cassandra's.
Merged patch series by Calle Wilund, with a few fixes by Piotr Jastrzębski:
Adds delta and pre-image data column writes for the atomic columns in a
cdc-enabled table.
Note that in this patch set it is still unconditional. Adding option support
comes in next set.
Uses code more or less derived from alternator to select pre-image, using
raw query interface. So should be fairly low overhead to query generation.
Pre-image and delta mutations are mixed in with the actual modification
mutations to generate the full cdc log (sans post-image).
Even if no rows match clustering key restrictions of a conditional
statement with static columns conditions, we still must include the
static column value into the CAS failure result set. For example,
the following conditional DELETE statement
create table t(k int, c int, s int static, v int, primary key(k, c));
insert into t(k, s) values(1, 1);
delete v from t where k=1 and c=1 if v=1 and s=1;
must return
[applied=False, v=null, s=1]
not just
[applied=False, v=null, s=null]
To fix that, set partition_slice::option::always_return_static_content
for querying rows used for checking conditions so that we have the
static row in update_parameters::prefetch_data even if no regular row
matches clustering column restrictions. Plus modify cas_request::
applies_to() so that it sets is_in_cas_result_set flag for the static
row in case there are static column conditions, but the result set
happens to be empty.
As pointed out by Tomek, there's another reason to set partition_slice::
option::always_return_static_content apart from building a correct
result set on CAS failure. There could be a batch with two statements,
one with clustering key restrictions which select no row, and another
statement with only static column conditions. If we didn't enable this
flag, we wouldn't get a static row even if it exists, and static column
conditions would evaluate as if the static row didn't exist, for
example, the following batch
create table t(k int, c int, s int static, primary key(k, c));
insert into t(k, s) values(1, 1);
begin batch
insert into t(k, c) values(1, 1) if not exists
update t set s = 2 where k = 1 if s = 1
apply batch;
would fail although it clearly must succeed.
A SELECT statement that has clustering key restrictions isn't supposed
to return static content if no regular rows matches the restrictions,
see #589. However, for the CAS statement we do need to return static
content on failure so this patch adds a flag that allows the caller to
override this behavior.
Apart from conditional statements, there may be other reading statements
in a batch, e.g. manipulating lists. We must not include rows fetched
for them into the CAS result set. For instance, the following CAS batch:
create table t(p int, c int, i int, l list<int>, primary key(p, c));
insert into t(p, c, i) values(1, 1, 1)
insert into t(p, c, i, l) values(1, 1, 1, [1, 2, 3])
begin batch
update t set i=3 where p=1 and c=1 if i=2
update t set l=l-[2] where p=1 and c=2
apply batch;
is supposed to return
[applied] | p | c | i
----------+---+---+---
False | 1 | 1 | 1
not
[applied] | p | c | i
----------+---+---+---
False | 1 | 1 | 1
False | 1 | 2 | 1
To filter out such collateral rows from the result set, let's mark rows
checked by conditional statements with a special flag.
If a CQL statement only updates static columns, i.e. has no clustering
key restrictions, we still fetch a regular row so that we can check it
against EXISTS condition. In this case we must be especially careful: we
can't simply pass the row to modification_statement::applies_to, because
it may turn out that the row has no static columns set, i.e. there's no
in fact static row in the partition. So we filter out such rows without
static columns right in cas_request::applies_to before passing them
further to modification_statement::applies_to.
Example:
create table t(p int, c int, s int static, primary key(p, c));
insert into t(p, c) values(1, 1);
insert into t(p, s) values(1, 1) if not exists;
The conditional statement must succeed in this case.
In case a CQL statement has only static columns conditions, we must
ignore clustering key restrictions.
Example:
create table t(p int, c int, s int static, v int, primary key(p, c));
insert into t(p, s) values(1, 1);
update t set v=1 where p=1 and c=1 if s=1;
This conditional statement must successfully insert row (p=1, c=1, v=1)
into the table even though there's no regular row with p=1 and c=1 in
the table before it's executed, because the statement condition only
applies to the static column s, which exists and matches.
If a modification statement doesn't have a clustering column restriction
while the table has static columns, then EXISTS condition just needs to
check if there's a static row in the partition, i.e. it doesn't need to
select any regular rows. Let's treat such EXIST condition like a static
column condition so that we can ignore its clustering key range while
checking CAS conditions.
This will allow us to add helper methods and store extra info in each
row. For example, we can add a method for checking if a row has static
columns. Also, to build CAS result set, we need to differentiate rows
fetched to check conditions from those fetched for reading operations.
Using struct as row container will allow us to store this information in
each prefetched row.
Currently, we set _sets_regular_columns/_sets_static_columns flags when
adding regular/static conditions to modification_statement. We use them
in applies_only_to_static_columns() function that returns true iff
_sets_static_columns is set and _sets_regular_columns is clear. We
assume that if this function returns true then the statement only deals
with static columns and so must not have clustering key restrictions.
Usually, that's true, but there's one exception: DELETE FROM ...
statement that deletes whole rows. Technically, this statement doesn't
have any column operations, i.e. _sets_regular_columns flag is clear.
So if such a statement happens to have a static condition, we will
assume that it only applies to static columns and mistakenly raise an
error.
Example:
create table t(k int, c int, s int static, v int, primary key(k, c));
delete from t where k=1 and c=1 if s=1;
To fix this, let's not set the above mentioned flags when adding
conditions and instead check if _column_conditions array is empty in
applies_only_to_static_columns().
modification_statement::process_where_clause() assumes that both
operations and conditions has been added to the statement when it's
called: it uses this information to raise an error in case the statement
restrictions are incompatible with operations or conditions. Currently,
operations are set before this function is called, but not conditions.
This results in "Invalid restrictions on clustering columns since
the {} statement modifies only static columns" error while trying to
execute the following statements:
create table t(k int, c int, s int static, v int, primary key(k, c));
delete s from t where k=1 and c=1 if v=1;
update t set s=1 where k=1 and c=1 if v=1;
Fix this by always initializing conditions before processing WHERE
clause.
Print a histogram of the number of async work items in the shard's
outgoing smp queues.
Example:
(gdb) scylla smp-queues
10747 17 -> 3 ++++++++++++++++++++++++++++++++++++++++
721 17 -> 19 ++
247 17 -> 20 +
233 17 -> 10 +
210 17 -> 14 +
205 17 -> 4 +
204 17 -> 5 +
198 17 -> 16 +
197 17 -> 6 +
189 17 -> 11 +
181 17 -> 1 +
179 17 -> 13 +
176 17 -> 2 +
173 17 -> 0 +
163 17 -> 8 +
1 17 -> 9 +
Useful for identifying the target shard, when `scylla task_histogram`
indicates a high number of async work items.
To produce the histogram the command goes over all virtual objects in
memory and identifies the source and target queues of each
`seastar::smp_message_queue::async_work_item` object. Practically the
source queue will always be that of the current shard. As this scales
with the number of virtual objects in memory, it can take some time to
run. An alternative implementation would be to instead read the actual
smp queues, but the code of that is scary so I went for the simpler and
more reliable solution.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191028132456.37796-1-bdenes@scylladb.com>
This patch set introduces light-weight transactions support to
ScyllaDB. It is a subset of the full series, which adds
basic LWT support and which has been reviewed thus far.
"
mutation_test/test_udt_mutations kept failing on my machine and I tracked it down to the 3rd patch in this series (use int64_t constants for long_type). While at it, this series also fixes a comment and the end iterator in BOOST_REQUIRE(std::all_of(...))
mutation_test: test_udt_mutations: fixup udt comment
mutation_test: test_udt_mutations: fix end iterator in call to std::all_of
mutation_test: test_udt_mutations: use int64_t constants for long_type
Test: mutation_test(dev, debug)
"
* 'test_udt_mutations-fixes' of https://github.com/bhalevy/scylla:
mutation_test: test_udt_mutations: use int64_t constants for long_type
mutation_test: test_udt_mutations: fix end iterator in call to std::all_of
mutation_test: test_udt_mutations: fixup udt comment
Based on a mutation, creates a pre-image select operation.
Note, this uses raw proxy query to shortcut parsing etc,
instead of trying to cache by generated query. Hypothesis is that
this is essentially faster.
The routine assumes all rows in a mutation touch same static/regular
columns. If this is not always true it will need additional
calculations.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Support single-statement conditional updates and as well as batches.
This patch almost fully rewrites column_condition.cc, implementing
is_satisfied_by().
Most of the remaining complications in column_condition implementation
come from the need to properly handle frozen and multi-cell
collection in predicates - up until now it was not possible
to compare entire collection values between each other. This is further
complicated since multi-cell lists and sets are returned as maps.
We can no longer assume that the columns fetched by prefetch operation
are non-frozen collections. IF EXISTS/IF NOT EXISTS condition
fetches all columns, besides, a column may be needed to check other
condition.
When fetching the old row for LWT or to apply updates on list/columns,
we now calculate precisely the list of columns to fetch.
The primary key columns are also included in CAS batch result set,
and are thus also prefetched (the user needs them to figure out which
statements failed to apply).
The patch is cross-checked for compatibility with cassandra-3.11.4-1545-g86812fa502
but does deviate from the origin in handling of conditions on static
row cells. This is addressed in future series.
Each column_condition and raw::column_condition construction case had a
static method wrapping its constructor, simply supplying some defaults.
This neither improves clarity nor maintainability.
cql_statement_opt_metadata is an interim node
in cql (prepared) statement hierarchy parenting
modification_statement and batch_statement. If there
is IF condition in such statements, they return a result set,
and thus have a result set metadata.
The metadata itself is filled in a subsequent patch.
Add checks for conditional modification statement limitations:
- WHERE clustering_key IN (list) IF condition is not supported
since a conditions is evaluated for a single row/cell, so
allowing multiple rows to match the WHERE clause would create
ambiguity,
- the same is true for conditional range deletions.
- ensure all clustering restrictions are eq for conditional delete
We must not allow statements like
create table t(p int, c int, v int, primary key (p, c));
delete from t where p=1 and c>0 if v=1;
because there may be more than one statement in a partition satisfying
WHERE clause, in which case it's unclear which of them should satisfy
IF condition: all or just one.
Raising an error on such a statement is consistent with Cassandra's
behavior.
Introduce service::cas_request abstract base class
which can be used to parameterize Paxos logic.
Implement storage_proxy::cas() - compare and swap - the storage proxy
entry point for lightweight transactions.
Currently the code that manipulates mutations during write need to
check what kind of mutations are those and (sometimes) choose different
code paths. This patch encapsulates the differences in virtual
functions of mutation_holder object, so that high level code will not
concern itself with the details. The functions that are added:
apply_locally(), apply_remotely() and store_hint().
This patch adds all functionality needed for Paxos protocol. The
implementation does not strictly adhere to Paxos paper since the original
paper allows setting a value only once, while for LWT we need to be able
to make another Paxos round after "learn" phase completes, which requires
things like repair to be introduced.
Paxos protocol has three stages: prepare, accept, learn. This patch adds
rpc verb for each of those stages. To be term compatible with Cassandra
the patch calls those stages: prepare, propose, commit.
Paxos protocol relies on replicas having a state that persists over
crashes/restarts. This patch defines such state and stores it in the
database itself in the paxos table to make it persistent.
The stored state is:
in_progress_ballot - promised ballot
proposal - accepted value
proposal_ballot - the ballot of the accepted value
most_recent_commit - most recently learned value
most_recent_commit_at - the ballot of the most recently learned value
This patch add two data structures that will be used by paxos. First
one is "proposal" which contains a ballot and a mutation representing
a value paxos protocol is trying to set. Second one is
"prepare_response" which is a value returned by paxos prepare stage.
It contains currently accepted value (if any) and most recently
learned value (again if any). The later is used to "repair" replicas
that missed previous "learn" message.
Otherwise they are decomposed and serialized as 4-byte int32.
For example, on my machine cell[1] looked like this:
{0002, atomic_cell{0000000310600000;ts=0;expiry=-1,ttl=0}}
and it failed cells_equal against:
{0002, atomic_cell{0000000300000000;ts=0;expiry=-1,ttl=0}}
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
server::set_routes() was setting the value of server::_callbacks.
This led to a race condition, as set_routes() is invoked on every
shard simultaneously. It is also unnecessary, since _callbacks can be
initialized in the constructor.
Fixes#5220.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
Introduce the traced_file class which wraps a file, adding CQL trace messages before and after every operation that returns a future.
Use this file to trace reads from SSTable data and index files.
Fixes#4908.
"
* 'traced_file' of https://github.com/kbr-/scylla:
sstables: report sstable index file I/O in CQL tracing
sstables: report sstable data file I/O in CQL tracing
tracing: add traced_file class
"
This change allows creating tables with non-frozen UDT columns. Such columns can then have single fields modified or deleted.
I had to do some refactoring first. Please read the initial commit messages, they are pretty descriptive of what happened (read the commits in the order they are listed on my branch: https://github.com/kbr-/scylla/commits/udt, starting from kbr-@8eee36e, in order to understand them). I also wrote a bunch of documentation in the code.
Fixes#2201.
"
* 'udt' of https://github.com/kbr-/scylla: (64 commits)
tests: too many UDT fields check test
collection_mutation: add a FIXME.
tests: add a non-frozen UDT materialized view test
tests: add a UDT mutation test.
tests: add a non-frozen UDT "JSON INSERT" test.
tests: add a non-frozen UDT to for_each_schema_change.
tests: more non-frozen UDT tests.
tests: move some UDT tests from cql_query_test.cc to new file.
types: handle trailing nulls in tuples/UDTs better.
cql3: enable deleting single fields of non-frozen UDTs.
cql3: enable setting single fields of a non-frozen UDT.
cql3: enable non-frozen UDTs.
cql3: introduce user_types::marker.
cql3: generalize function_call::make_terminal to UDTs.
cql3: generalize insert_prepared_json_statement::execute_set_value to UDTs.
cql3: use a dedicated setter operation for inserting user types.
cql3: introduce user_types::value.
types: introduce to_bytes_opt_vec function.
cql3: make user_types::delayed_value::bind_internal return vector<bytes_opt>.
cql3: make cql3_type::raw_ut::to_string distinguish frozenness.
...
The health check is performed simply by issuing a GET request
to the alternator port - it returns the following status 200
response when the server is healthy:
$ curl -i localhost:8000
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 23
Server: Seastar httpd
Date: 21 Oct 2019 12:55:33 GMT
healthy: localhost:8000
This commit comes with a test.
Fixes#5050
Message-Id: <3050b3819661ee19640c78372e655470c1e1089c.1571921618.git.sarna@scylladb.com>
We could use iterators over cells instead of a vector of cells
in collection_mutation(_view)_description. Then some use cases could
provide iterators that construct the cells "on the fly".
Comparing user types after adding new fields was bugged.
In the following scenario:
create type ut (a int);
create table cf (a int primary key, b frozen<ut>);
insert into cf (a, b) values (0, (0));
alter type ut add b int;
select * from cf where b = {a:0,b:null};
the row with a = 0 should be returned, even though the value stored in the database is shorter
(by one null) than the value given by the user. Until now it wouldn't
have.
Add a cluster feature for non-frozen UDTs.
If the cluster supports non-frozen UDTs, do not return an error
message when trying to create a table with a non-frozen user type.
cql3::user_types::marker is a dedicated cql3::abstract_marker for user
type placeholders in prepared CQL queries. When bound, it returns a
user_types::value.
Previously it returned vector<cql3::raw_value>, even though we don't use
unset values when setting a UDT value (fields that are not provided
become nulls. Thats how C* does it).
This simplifies future implementation of user_types::{value, setter}.
is_value_compatible_with_internal and update_user_type were generalized
to the non-frozen case.
For now, all user_type_impls in the code are non-multi-cell (frozen).
This will be changed in future commits.
These functions are used to translate field indices, which are used to
identify fields inside UDTs, from/to a serialized representation to be
stored inside sstables and mutations.
They do it in a way that is compatible with C*.
The purpose of collection_type_impl::to_value was to serialize a
collection for sending over CQL. The corresponding function in origin
is called serializeForNativeProtocol, but the name is a bit lengthy,
so I settled for serialize_for_cql.
The method now became a free-standing function, using the visit
function to perform a dispatch on the collection type instead
of a virtual call. This also makes it easier to generalize it to UDTs
in future commits.
Remove the old serialize_for_native_protocol with a FIXME: implement
inside. It was already implemented (to_value), just called differently.
remove dead methods: enforce_limit and serialized_values. The
corresponding methods in C* are auxiliary methods used inside
serializeForNativeProtocol. In our case, the entire algorithm
is wholly written in serialize_for_cql.
`collection_type_impl::serialize_mutation_form`
became `collection_mutation(_view)_description::serialize`.
Previously callers had to cast their data_type down to collection_type
to use serialize_mutation_form. Now it's done inside `serialize`.
In the future `serialize` will be generalized to handle UDTs.
`collection_type_impl::deserialize_mutation_form`
became a free standing function `deserialize_collection_mutation`
with similiar benefits. Actually, noone needs to call this function
manually because of the next paragraph.
A common pattern consisting of linearizing data inside a `collection_mutation_view`
followed by calling `deserialize_mutation_form` has been abstracted out
as a `with_deserialized` method inside collection_mutation_view.
serialize_mutation_form_only_live was removed,
because it hadn't been used anywhere.
collection_type_impl::mutation became collection_mutation_description.
collection_type_impl::mutation_view became collection_mutation_view_description.
These classes now reside inside collection_mutation.hh.
Additional documentation has been written for these classes.
Related function implementations were moved to collection_mutation.cc.
This makes it easier to generalize these classes to non-frozen UDTs in future commits.
The new names (together with documentation) better describe their purpose.
The classes 'collection_mutation' and 'collection_mutation_view'
were moved to a separate header, collection_mutation.hh.
Implementations of functions that operate on these classes,
including some methods of collection_type_impl, were moved
to a separate compilation unit, collection_mutation.cc.
This makes it easier to modify these structures in future commits
in order to generalize them for non-frozen User Defined Types.
Some additional documentation has been written for collection_mutation.
Similar to "gossip: Limit number of pending gossip ACK messages", limit
the number of pending gossip ACK2 messages in gossiper::handle_ack_msg.
Fixes#5210
In a cross-dc large cluster, the receiver node of the gossip SYN message
might be slow to send the gossip ACK message. The ack messages can be
large if the payload of the application state is big, e.g.,
CACHE_HITRATES with a lot of tables. As a result, the unlimited ACK
message can consume unlimited amount of memory which causes OOM
eventually.
To fix, this patch queues the SYN message and handles it later if the
previous ACK message is still being sent. However, we only store the
latest SYN message. Since the latest SYN message from peer has the
latest information, so it is safe to drop the previous SYN message and
keep the latest one only. After this patch, there can be at most 1
pending SYN message and 1 pending ACK message per peer node.
Fixes#5210
Merged patch series from Piotr Sarna:
This series couples system_auth.roles with authorization routines
in alternator. The `salted_hash` field, which is every user's hashed
password, is used as a secret key for the signature generation
in alternator.
This series also adds related expiration verifications for alternator
signatures.
It also comes with more test cases and docs updates.
Tests: alternator(local, remote), manual
Piotr Sarna (11):
alternator: add extracting key from system_auth.roles
alternator: futurize verify_signature function
alternator: move the api handler to a separate function
alternator: use keys from system_auth.roles for authorization
alternator: add key cache to authorization
alternator-test: add a wrong password test
alternator: verify that the signature has not expired
alternator: add additional datestamp verification
alternator-test: add tests for expired signatures
docs: update alternator entry for authorization
alternator-test: add authorization to README
alternator-test/conftest.py | 2 +-
alternator-test/test_authorization.py | 44 ++++++++-
alternator-test/test_describe_endpoints.py | 2 +-
alternator/auth.hh | 15 ++-
alternator/server.hh | 10 +-
alternator/auth.cc | 62 +++++++++++-
alternator/server.cc | 106 ++++++++++++---------
alternator-test/README.md | 28 ++++++
docs/alternator/alternator.md | 7 +-
9 files changed, 221 insertions(+), 55 deletions(-)
ommit 93270dd changed gc_clock to be 64-bit, to fix the Y2038
problem. While 64-bit tombstone::deletion_time is serialized in a
compatible way, TTLs (gc_clock::duration) were not.
This patchset reverts TTL serialization to the 32-bit serialization
format, and also allows opting-in to the 64-bit format in case a
cluster was installed with the broken code. Only Scylla 3.1.0 is
vulnerable.
Fixes#4855
Tests: unit (dev)
From Shlomi:
4 node cluster Node A, B, C, D (Node A: seed)
cassandra-stress write n=10000000 -pop seq=1..10000000 -node <seed-node>
cassandra-stress read duration=10h -pop seq=1..10000000 -node <seed-node>
while read is progressing
Node D: nodetool decommission
Node A: nodetool status node - wait for UL
Node A: nodetool cleanup (while decommission progresses)
I get the error on c-s once decommission ends
java.io.IOException: Operation x0 on key(s) [383633374d31504b5030]: Data returned was not validated
The problem is when a node gets new ranges, e.g, the bootstrapping node, the
existing nodes after a node is removed or decommissioned, nodetool cleanup will
remove data within the new ranges which the node just gets from other nodes.
To fix, we should reject the nodetool cleanup when there is pending ranges on that node.
Note, rejecting nodetool cleanup is not a full protection because new ranges
can be assigned to the node while cleanup is still in progress. However, it is
a good start to reject until we have full protection solution.
Refs: #5045
Scylla 3.1.0 broke the serialization format for TTLs. Later versions
corrected it, but if a cluster was originally installed as 3.1.0,
it will use the broken serialization forever. This configuration option
allows upgrades from 3.1.0 to succeed, by enabling the broken format
even for later versions.
Scylla 3.1.0 inadvertently changed the serialization format of TTLs
(internally represented as gc_clock::duration) from 32-bit to 64-bit,
as part of preparation for Y2038 (which comes earlier for TTLed cells).
This breaks mutations transported in a mixed cluster.
To fix this, we revert back to the 32-bit format, unless we're in a 3.1.0-
heritage cluster, in which case we use the 64-bit format. Overflow of
a TTL is not a concern, since TTLs are capped to 20 years by the TTL layer.
An assertion is added to verify this.
This patch only defines a variable to indicate we're in
a 3.1.0 heritage cluster, but a way to set it is left to
a later patch.
* seastar 6bcb17c964...2963970f6b (4):
> Merge "IPv6 scope support and network interface impl" from Calle
> noncopyable_function: do not copy uninitialized data
> Merge "Move smp and smp queue out of reactor" from Asias
> Consolidate posix socket implementations
The README paragraph informs about turning on authorization with:
alternator-enforce-authorization: true
and has a short note on how to set up the secret key for tests.
The first test case ensures that expired signatures are not accepted,
while the second one checks that signatures with dates that reach out
too far into the future are also refused.
The authorization signature contains both a full obligatory date header
and a shortened datestamp - an additional verification step ensures that
the shortened stamp matches the full date.
AWS signatures have a 15min expiration policy. For compatibility,
the same policy is applied for alternator requests. The policy also
ensures that signatures expanding more than 15 minutes into the future
are treated as unsafe and thus not accepted.
The additional test case submits a request as a user that is expected
to exist (in the local setup), but the provided password is incorrect.
It also updates test_wrong_key_access so it uses an empty string
for trying to authenticate as an inexistent user - in order to cover
more corner cases.
In order to avoid fetching keys from system_auth.roles system table
on every request, a cache layer is introduced. And in order not to
reinvent the wheel, the existing implementation of loading_cache
with max size 1024 and a 1 minute timeout is used.
Instead of having a hardcoded secret key, the server now verifies
an actual key extracted from system_auth.roles system table.
This commit comes with a test update - instead of 'whatever':'whatever',
the credentials used for a local run are 'alternator':'secret_pass',
which matches the initial contents of system_auth.roles table,
which acts as a key store.
Fixes#5046
The lambda used for handling the api request has grown a little bit
too large, so it's moved to a separate method. Along with it,
the callbacks are now remembered inside the class itself.
The verify_signature utility will later be coupled with Scylla
authorization. In order to prepare for that, it is first transformed
into a function that returns future<>, and it also becomes a member
of class server. The reason it becoming a member function is that
it will make it easier to implement a server-local key cache.
As a first step towards coupling alternator authorization with Scylla
authorization, a helper function for extracting the key (salted_hash)
belonging to the user is added.
From Shlomi:
4 node cluster Node A, B, C, D (Node A: seed)
cassandra-stress write n=10000000 -pop seq=1..10000000 -node <seed-node>
cassandra-stress read duration=10h -pop seq=1..10000000 -node <seed-node>
while read is progressing
Node D: nodetool decommission
Node A: nodetool status node - wait for UL
Node A: nodetool cleanup (while decommission progresses)
I get the error on c-s once decommission ends
java.io.IOException: Operation x0 on key(s) [383633374d31504b5030]: Data returned was not validated
The problem is when a node gets new ranges, e.g, the bootstrapping node, the
existing nodes after a node is removed or decommissioned, nodetool cleanup will
remove data within the new ranges which the node just gets from other nodes.
To fix, we should reject the nodetool cleanup when there is pending ranges on that node.
Note, rejecting nodetool cleanup is not a full protection because new ranges
can be assigned to the node while cleanup is still in progress. However, it is
a good start to reject until we have full protection solution.
Refs: #5045
Argument evaluation order is UB, so it's not guaranteed that
c->make_garbage_collected_sstable_writer() is called before
compaction is moved to run().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20191023052647.9066-1-raphaelsc@scylladb.com>
Make it possible to compare multi-cell lists and sets serialized
as maps with literal values and serialize them to network using
a standard format (vector of values).
This is a pre-requisite patch for column condition evaluation
in light-weight transactions.
Merged patch series from Botond Dénes:
This series extends the existing docs/debugging.md with a detailed guide
on how to debug Scylla coredumps. The intended target audience is
developers who are debugging their first core, hence the level of
details (hopefully enough). That said this should be just as useful for
seasoned debuggers just quickly looking up some snippet they can't
remember exactly. A Throubleshooting chapter is also added in this
series for commonly-met problems.
I decided to create this guide after myself having struggled for more
than a day on just opening(!) a coredump that was produced on Ubuntu.
As my main source, I used the How-to-debug-a-coredump page from the
internal wiki which contains many useful information on debugging
coredumps, however I found it to be missing some crucial information, as
well being very terse, thus being primarily useful for experienced
debuggers who can fill in the blanks. The reason I'm not extending said
wiki page is that I think this information should not be hidden in some
internal wiki page. Also, docs/debugging.md now seems to be a much
better base for such a document. This document was started as a
comprehensive debugging manual for beginners (but not just).
You will notice that the information on how to debug cores from
CentOS/Redhat are quite sparse. This is because I have no experience
with such cores, so for now the respective chapters are just stubs. I
intend to complete them in the future after having gained the necessary
experience and knowledge, however those being in possession of said
knowledge are more then welcome to send a patch. :)
Botond Dénes (4):
docs/debugging.md: demote 'Starting GDB' and 'Using GDB'
docs/debugging.md: fix formatting issues
docs/debugging.md: add 'Debugging coredumps' subchapter
docs/debugging.md: add 'Throubleshooting' subchapter
docs/debugging.md | 240 +++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 228 insertions(+), 12 deletions(-)
Add lua as a dependency in preparation for UDF. This is the first
patch since it has to go in before to allow for a frozen toolchain
update.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
[avi: update frozen toolchain image]
Message-Id: <20191018231442.11864-2-espindola@scylladb.com>
Multi-cell lists and maps may be stored in different formats: as sorted
vectors of pairs of values, when retreived from storage, or as sorted
vectors of values, when created from parser literals or supplied as
parameter values.
Implement a specialized compare for use when receiver and paramter
representation don't match.
Add helpers.
The problem is that backlog tracker is not being updated properly after
incremental compaction.
When replacing sstables earlier, we tell backlog tracker that we're done
with exhausted sstables[1], but we *don't* tell it about the new, sealed
sstables created that will replace the exhausted ones.
[1]: exhausted sstable is one that can be replaced earlier by compaction.
We need to notify backlog tracker about every sstable replacement which
was triggered by incremental compaction.
Otherwise, backlog for a table that enables incremental compaction will
be lower than it actually should. That's because new sstables being
tracked as partial decrease the backlog, whereas the exhausted ones
increase it.
The formula for a table's backlog is basically:
backlog(sstable set + compacting(1) - partial(2))
(1) compacting includes all compaction's input sstables, but the
exhausted ones are removed from it (correct behavior).
(2) partial includes all compaction's output sstables, but the ones
that replaced the exhausted sstables aren't removed from it (incorrect
behavior).
This problem is fixed by making backlog track *fully* aware of the early
replacement, not only the exhausted sstables, but also the new sstables
that replaced the exhausted ones. The new sstables need to be moved
inside the tracker from partial state to the regular one.
Fixes#5157.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20191016002838.23811-1-raphaelsc@scylladb.com>
Rather than passing a pointer to a cql_stats member corresponding to
the statement type, pass a reference to a cql_stats object and use
statement_type, which is already stored in modification_statement, for
determining which counter to increment. This will allow us to account
conditional statements, which will have a separate set of counters,
right in modification_statement::execute() - all we'll need to do is
add the new counters and bump them in case execute_with_condition is
called.
While we are at it, remove extra inclusions from statement_type.hh so as
not to introduce any extra dependencies for cql_stats.hh users.
Message-Id: <20191022092258.GC21588@esperanza>
Merged patch series from Avi Kivity:
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.
Change it to be optional by allocating it as an external object rather
than inlined into mutation_partition. This adds overhead when the
static row is present (17 bytes for the reference, back reference,
and lsa allocator overhead).
perf_simple_query appears to marginally (2%) faster. Footprint is
reduced by ~9% for a cache entry, 12% in memtables. More details are
provided in the patch commitlog.
Tests: unit (debug)
Avi Kivity (4):
managed_ref: add get() accessor
managed_ref: add external_memory_usage()
mutation_partition: introduce lazy_row
mutation_partition: make static_row optional to reduce memory
footprint
cell_locking.hh | 2 +-
converting_mutation_partition_applier.hh | 4 +-
mutation_partition.hh | 284 ++++++++++++++++++++++-
partition_builder.hh | 4 +-
utils/managed_ref.hh | 12 +
flat_mutation_reader.cc | 2 +-
memtable.cc | 2 +-
mutation_partition.cc | 45 +++-
mutation_partition_serializer.cc | 2 +-
partition_version.cc | 4 +-
tests/multishard_mutation_query_test.cc | 2 +-
tests/mutation_source_test.cc | 2 +-
tests/mutation_test.cc | 12 +-
tests/sstable_mutation_test.cc | 10 +-
14 files changed, 355 insertions(+), 32 deletions(-)
"
The node startup code (in particular the functions storage_service::prepare_to_join and storage_service::join_token_ring) is complicated and hard to understand.
This patch set aims to simplify it at least a bit by removing some dead code, moving code around so it's easier to understand and adding some comments that explain what the code does.
I did it to help me prepare for implementing generation and gossiping of CDC streams.
"
* 'bootstrap-refactors' of https://github.com/kbr-/scylla:
storage_service: more comments in join_token_ring
db: remove system_keyspace::update_local_tokens
db: improve documentation for update_tokens and get_saved_tokens in system_keyspace
storage_service: remove storage_service::_is_bootstrap_mode.
storage_service: simplify storage_service::bootstrap method
storage_service: fix typo in handle_state_moving
storage_service: remove unnecessary use of stringstream
storage_service: remove redundant call to update_tokens during join_token_ring
storage_service: remove storage_service::set_tokens method.
storage_service: remove is_survey_mode
storage_service::handle_state_normal: tokens_to_update* -> owned_tokens
storage_service::handle_state_normal: remove local_tokens_to_remove
db::system_keyspace::update_tokens: take tokens by const ref
db::system_keyspace::prepare_tokens: make static, take tokens by const ref
token_metadata::update_normal_tokens: take tokens by const ref
Assume n1 and n2 in a cluster with generation number g1, g2. The
cluster runs for more than 1 year (MAX_GENERATION_DIFFERENCE). When n1
reboots with generation g1' which is time based, n2 will see
g1' > g2 + MAX_GENERATION_DIFFERENCE and reject n1's gossip update.
To fix, check the generation drift with generation value this node would
get if this node were restarted.
This is a backport of CASSANDRA-10969.
Fixes#5164
The flag did nothing. It was used in one place to check if there's a
bug, but it can easily by proven by reading the code that the check
would never pass.
The storage_service::bootstrap method took a parameter: tokens to
bootstrap with. However, this method is only called in one place
(join_token_ring) with only one parameter: _bootstrap_tokens. It doesn't
make sense to call this method anywhere else with any other parameter.
This commit also adds a comment explaining what the method does and
moves it into the private section of storage_service.
When a non-seed node was bootstrapping, system_keyspace::update_tokens
was called twice: first right after the tokens were generated (or
received if we were replacing a different node) in the call to
`bootstrap`, and then later in join_token_ring. The second call was
redundant.
The join_token_ring call was also redundant if we were not bootstrapping
and had tokens saved previously (e.g. when restarting). In that case we
would have read them from LOCAL and then save the same tokens again.
This commit removes the redundant call and inserts calls to
update_tokens where they are necessary, when new tokens are generated.
The aim is to make the code easier to understand.
It also adds a comment which explains why the tokens don't need to be
generated in one of the cases.
After commit 36ccf72f3c, this method
was used only in one place.
Its name did not make it obvious what it does and when is it safe to call it.
This commit pulls out the code from set_tokens to the point where it was
called (join_token_ring). The code is only possible to understand in
context.
This code was also saving the tokens to the LOCAL table before
retrieving them from this table again. There is no point in doing that:
1. there are no races, since when join_token_ring is running, it is the
only function which can call system_keyspace::update_tokens (which saves them to the
LOCAL table). There can be no multiple instances of join_token_ring.
2. Even if there was a race, this wouldn't fix anything. The tokens we
retrieve from LOCAL by calling get_local_tokens().get0() could already
be different in the LOCAL table when the get0() returns.
Replace the two variables:
tokens_to_update_in_metadata
tokens_to_update_in_system_keyspace
which were exactly the same, with one variable owned_tokens.
The new name describes what the variable IS instead what's it used for.
Add a comment to clarify what "owned" means: those are the tokens the
node chose and any collision was resolved positively for this node.
Move the variable definition further down in the code, where it's
actually needed.
Merged patch series from Piotr Sarna:
Calculating the select statement for given view_info structure
used to work fine, but once local indexes were introduced, a subtle
bug appeared: the legacy token column does not exist in local indexes
and a valid clustering key column was omitted instead.
That results in potentially incorrect partition slices being used later
in read-before-write.
There's a long term plan for removing select_statement from
view info altogether, but nonetheless the bug needs to be fixed first.
Branch: master, 3.1
Tests: unit(dev) + manual confirmation that a correct legacy column is picked
Merge a patch series from Piotr Jastrzębski (haaawk):
This PR introduces CDC in it's minimal version.
It is possible now to create a table with CDC enabled or to enable/disable
CDC on existing table. There is a management of CDC log and description
related to enabling/disabling CDC for a table.
For now only primary key of the changed data is logged.
To be able to co-locate cdc streams with related base table partitions it
was needed to propagate the information about the number of shards per node.
This was node through gossip.
There is an assumption that all the nodes use the same value for
sharding_ignore_msb_bits. If it does not hold we would have to gossip
sharding_ignore_msb_bits around together with the number of shards.
Fixes#4986.
Tests: unit(dev, release, debug)
Currently, the function that generates the graph edges (and vertices)
with a breadth-first traversal of the object graph accidentally uses the
object that is the starting point of the graph as the "to" part of each
edge. This results in the graph having each of its edges point to the
starting point, as if all objects in it referenced said object directly.
Fix by using the object of the currently examined object.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191018113019.95093-1-bdenes@scylladb.com>
To the 'Debuggin Scylla with GDB` chapter. The '### Debugging
relocatable binaries built with the toolchain' subchapter is demoted to
be just a section in this new subchapter. It is also renamed to
'Relocatable binaries'.
This subchapter intends to be a complete guide on how to debug coredumps
from how to obtain the correct version of all the binaries all the way
to how to correctly open the core with GDB.
* seastar e888b1df...6bcb17c9 (4):
> iotune: don't crash in sequential read test if hitting EOF
> Remove FindBoost.cmake from install files
> Merge "Move reactor backend out of reactor" from Asias
> fair_queue: Add fair_queue.cc
This patch wrapps announce_migration logic into a lambda
that will be used both when cdc is used and when it's not.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment, this test only checks that table
creation and alteration sets cdc_options property
on a table correctly.
Future patches will extend this test to cover more
CDC aspects.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We would like to share with other nodes
the value of ignore_msb_bits property used by the node.
This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.
To be able to generate such stream_id we need
to know ignore_msb_bits property value for each node.
IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We would like to share with other nodes
the number of shards available at the node.
This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.
To be able to generate such stream_id we need
to know how many shards are on each node.
IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Refactor modification_statement to enable lightweight
transaction implementation.
This patch set re-arranges logic of
modification_statement::get_mutations() and uses
a column mask of identify the columns to prefetch.
It also pre-computes a few modification statement properties
at prepare, assuming the prepared statement is invalidated if
the underlying schema changes.
They are used more extensively with introduction of lightweight
transactions, and pre-computing makes it easier to reason about
complexity of the scenarios where they are involved.
Pre-compute column mask of columns to prefetch when preparing
a modification statement and use it to build partition_slice
object for read command. Fetch only the required columns.
Ligthweight transactions build up on this by using adding
columns used in conditions and in cas result set to the column
maks of columns to read. Batch statements unite all column
masks to build a single relation for all rows modified by
conditional statements of a batch.
Refactor get_mutations() so that the read command and
apply_updates() functions can be used in lightweight transactions.
Move read_command creation to an own method, as well as apply_updates().
Rewrite get_mutations() using the new API.
Avoid unnecessary shared pointers.
Introduce a column definition ordinal_id and use it in boosted
update_parameters::prefetch_data as a column index of a full row.
Lightweight transactions prefetch data and return a result set.
Make sure update_parameters::prefetch_data can serve as a
single representation of prefetched list cells as well as
condition cells and as a CAS result set.
I have a lot of plans for column_definition::ordinal_id, it
simplifies a lot of operations with columns and will also be
used for building a bitset of columns used in a query
or in multiple queries of a batch.
In modification_statement/batch_statement, we need to prefetch data to
1) apply list operations
2) evaluate CAS conditions
3) return CAS result set.
Boost update_parameters::prefetch_data to serve as a single result set
for all of the above. In case of a batch, store multiple rows for
multiple clustering keys involved in the batch.
Use an ordered set for columns and rows to make sure 3) CAS result set
is returned to the client in an ordered manner.
Deserialize the primary key and add it to result set rows since
it is returned to the client as part of CAS result set.
Index columns using ordinal_id - this allows having a single
set for all columns and makes columns easy to look up.
Remove an extra memcpy to build view objects when looking
up a cell by primary key, use partition_key/clustering_key
objects for lookup.
Get rid of an unnecessary optional around
update_parameters::prefetch_data.
update_parameters won't own prefetch_data in the future anyway,
since prefetch_data can be shared among multiple modification
statements of a batch, each statement having its own options
and hence its own update_parameters instance.
Move prefetch_data_builder class from modification_statement.cc
to update_parameters.cc.
We're going to share the same builder to build a result set
for condition evaluation and to apply updates of batch statements, so we
need to share it.
No other changes.
Make sure every column in the schema, be it a column of partition
key, clustering key, static or regular one, has a unique ordinal
identifier.
This makes it easy to compute the set of columns used in a query,
as well as index row cells.
Allow to get column definition in schema by ordinal id.
"
Incremental compaction code to release exhausted sstables was inefficient because
it was basically preventing any release from ever happening. So a new solution is
implemented to make incremental compaction approach actually efficient while
being cautious about not introducing data resurrection. This solution consists of
storing GC'able tombstones in a temporary sstable and keeping it till the end of
compaction. Overhead is avoided by not enabling it to strategies that don't work
with runs composed of multiple fragments.
Fixes#4531.
tests: unit, longevity 1TB for incremental compaction
"
* 'fix_incremental_compaction_efficiency/v6' of https://github.com/raphaelsc/scylla:
tests: Check that partition is not resurrected on compaction failure
tests: Add sstable compaction test for gc-only mutation compactor consumer
sstables: Fix Incremental Compaction Efficiency
Introduce `scylla generate_object_graph`, a command which generates a
visual object graph, where vertices are objects and edges are
references. The graph starts from the object specified by the user. The
graph allows visual inspection of the object graph and hopefully allows
the user to identify the object in question.
Add the `--resolve` flag to `scylla find`. When specified, `scylla find`
will attempt to resolve the first pointer in the found objects as a vtable
pointer. If successful the pointer as well as the resolved symbol will
be added to the listing.
In the listing of `scylla fiber` also print the starting task (as the
first item).
This mini-series contains assorted improvements that I found very useful
while debugging OOM crashes in the past weeks:
* A wrapper for `std::list`.
* A wrapper for `std::variant`.
* Making `scylla find` usable from python code.
* Improvements to `scylla sstables` and `scylla task_histogram`
commands.
* The `$downcast_vptr()` convenience function.
* The `$dereference_lw_shared_ptr()` convenience function.
Convenience functions in gdb are similar to commands, with some key
differences:
* They have a defined argument list.
* They can return values.
* They can be part of any gdb expression in which functions are allowed.
This makes them very useful for doing operations on values then
returning them so that the developer can use it the gdb shell.
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.
Change it to be optional by using lazy_row instead of row. Some call
sites treewide were adjusted to deal with the extra indirection.
perf_simple_query appears to improve by 2%, from 163krps to 165 krps,
though it's hard to be sure due to noisy measurements.
memory_footprint comparisons (before/after):
mutation footprint: mutation footprint:
- in cache: 1096 - in cache: 992
- in memtable: 854 - in memtable: 750
- in sstable: 351 - in sstable: 351
- frozen: 540 - frozen: 540
- canonical: 827 - canonical: 827
- query result: 342 - query result: 342
sizeof(cache_entry) = 112 sizeof(cache_entry) = 112
-- sizeof(decorated_key) = 36 -- sizeof(decorated_key) = 36
-- sizeof(cache_link_type) = 32 -- sizeof(cache_link_type) = 32
-- sizeof(mutation_partition) = 200 -- sizeof(mutation_partition) = 96
-- -- sizeof(_static_row) = 112 -- -- sizeof(_static_row) = 8
-- -- sizeof(_rows) = 24 -- -- sizeof(_rows) = 24
-- -- sizeof(_row_tombstones) = 40 -- -- sizeof(_row_tombstones) = 40
sizeof(rows_entry) = 232 sizeof(rows_entry) = 232
sizeof(lru_link_type) = 16 sizeof(lru_link_type) = 16
sizeof(deletable_row) = 168 sizeof(deletable_row) = 168
sizeof(row) = 112 sizeof(row) = 112
sizeof(atomic_cell_or_collection) = 8 sizeof(atomic_cell_or_collection) = 8
Tests: unit (dev)
lazy_row adds indirection to the row class, in order to reduce storage requirements
when the row is not present. The intent is to use it for the static row, which is
not present in many schemas, and is often not present in writes even in schemas that
have a static row.
Indirection is done using managed_ref, which is lsa-compatible.
lazy_row implements most of row's methods, and a few more:
- get(), get_existing(), and maybe_create(): bypass the abstraction and the
underlying row
- some methods that accept a row parameter also have an overload with a lazy_row
parameter
"Delete README-DPDK.md, move IDL.md to docs/ and fix
docs/review-checklist.md to point to scylla's coding style document,
instead of seastar's."
* 'documentation-cleanup/v3' of https://github.com/denesb/scylla:
docs/review-checklist.md: point to scylla's coding-style.md instead of seastar's
docs: mv coding-style.md docs/
rm README-DPDK.md
docs: mv IDL.md docs/
"Delete README-DPDK.md, move IDL.md to docs/ and fix
docs/review-checklist.md to point to scylla's coding style document,
instead of seastar's."
* 'documentation-cleanup/v3' of https://github.com/denesb/scylla:
docs/review-checklist.md: point to scylla's coding-style.md instead of seastar's
docs: mv coding-style.md docs/
rm README-DPDK.md
docs: mv IDL.md docs/
Allow returning fewer random clustering keys than requested since
the schema may limit the total number we can generate, for example,
if there is only one boolean clustering column.
Fixes#5161
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Calculating the select statement for given view_info structure
used to work fine, but once local indexes were introduced, a subtle
bug appeared: the legacy token column does not exist in local indexes
and a valid clustering key column was omitted instead.
That results in potentially incorrect partition slices being used later
in read-before-write.
There's a long term plan for removing select_statement from
view info altogether, but nonetheless the bug needs to be fixed first.
When investigating OOM:s a prominent pattern is a size class that is
exploded, using up most of the available memory alone. If one is lucky,
the objects causing the OOM are instances of some virtual class, making
their identification easy. Other times the objects are referenced by
instances of some virtual class, allowing their identification with some
work. However there are cases where neither these objects nor their
direct referrers are instances of virtual classes. This is the case
`scylla generate_object_graph` intends to help.
scylla generate_object_graph, like its name suggests generates the
object graph of the requested object. The object graph is a directed
graph, where vertices are objects and edges are references between them,
going from referrers to the referee. The vertices contain information,
like the address of the object, its size, whether it is a live or not
and if applies, the address and symbol name of its vtable. The edges
contain the list of offsets the referrer has references at. The
generated graph is an image, which allows the visual inspection of the
object graph, allowing the developer to notice patterns and hopefully
identify the problematic objects.
The graph is generated with the help of `graphwiz`. The command
generates `.dot` files which can be converted to images with the help of
the `dot` utility. The command can do this if the output file is one of
the supported image formats (e.g. `png`), otherwise only the `.dot` file
is generated, leaving the actual image generation to the user.
Add `--resolve` flag, which will make the command attempt to resolve the
first pointer of the found objects as a vtable pointer. If this is
successful the vtable pointer as well as the symbol name will be added
to the listing. This in particular makes backtracing continuation chains
a breeze, as the continuation object the searched one depends on can be
found at glance in the resulting listing (instead of having to manually
probe each item).
The arguments of `scylla find` are now parsed via `argparse`. While at
it, support for all the size classes supported by the underlying `find`
command were added, in addition to `w` and `g`. However the syntax of
specifying the size class to use has been changed, it now has to be
specified with the `-s|--size` command line argument, instead of passing
`-w` or `-g`.
Or in other words, the task that is the argument of the search. Example:
(gdb) scylla fiber 0x60001a305910
Starting task: (task*) 0x000060001a305910 0x0000000004aa5260 vtable for seastar::continuation<...> + 16
#0 (task*) 0x0000600016217c80 0x0000000004aa5288 vtable for seastar::continuation<...> + 16
#1 (task*) 0x000060000ac42940 0x0000000004aa2aa0 vtable for seastar::continuation<...> + 16
#2 (task*) 0x0000600023f59a50 0x0000000004ac1b30 vtable for seastar::continuation<...> + 16
This code is currently duplicated in `find_vptrs()` and
`scylla_task_histogram`. Refactor it out into a function.
The code is also improved in two ways:
* Make the search stricter, ensuring (hopefully) that indeed the
executable's text section is found, not that of the first object in
the `gdb file` listing.
* Throw an exception in the case when the search fails.
We don't want to add shared sstables to table's backlog tracker because:
1) table's backlog tracker has only an influence on regular compaction
2) shared sstables are never regular compacted, they're worked by
resharding which has its own backlog tracker.
Such sstables belong to more than one shard, meaning that currently
they're added to backlog tracker of all shards that own them.
But the thing is that such sstables ends up being resharded in shard
that may be completely random. So increasing backlog of all shards
such sstables belong to, won't lead to faster resharding. Also, table's
backlog tracker is supposed to deal only with regular compaction.
Accounting for shared sstables in table's tracker may lead to incorrect
speed up of regular compactions because the controller is not aware
that some relevant part of the backlog is due to pending resharding.
The fix is about ignoring sstables that will be resharded and let
table's backlog tracker account only for sstables that can be worked on
by regular compaction, and rely on resharding controlling itself
with its own tracker.
NOTE: this doesn't fix the resharding controlling issue completely,
as described in #4952. We'll still need to throttle regular compaction
on behalf of resharding. So subsequent work may be about:
- move resharding to its own priority class, perhaps streaming.
- make a resharding's backlog tracker accounts for sstables in all of
its pending jobs, not only the ongoing ones (currently limited to 1 by shard).
- limit compaction shares when resharding is in progress.
THIS only fixes the issue in which controller for regular compaction
shouldn't account sstables completely exclusive to resharding.
Fixes#5077.
Refs #4952.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190924022109.17400-1-raphaelsc@scylladb.com>
Incremental compaction efficiency depends on the reference of sstables
compacted being all released because the file descriptors of sstable
components are only closed once the sstable object is destructed.
Incremental compaction is not working for major compaction because a reference
to released sstables are being kept in the compaction manager, which prevents
their disk usage from being released. So the space amplification would be
the same as with a non-incremental approach, i.e. needs twice the amount of
used disk space for the table(s). With this issue fixed, the database now
becomes very major compaction friendly, the space requirement becoming very
low, a constant which is roughly number of fragments being currently compacted
multiplied by fragment size (1GB by default), for each table involved.
Fixes#5140.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20191003211927.24153-1-raphaelsc@scylladb.com>
Make sure gc'able-tombstone-only sstable is properly generated with data that
comes from regular compaction's input sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction prevents data resurrection from happening by checking that there's
no way a data shadowed by a GC'able tombstone will survive alone, after
a failure for example.
Consider the following scenario:
We have two runs A and B, each divided to 5 fragments, A1..A5, B1..B5.
They have the following token ranges:
A: A1=[0, 3] A2=[4, 7] A3=[8, 11] A4=[12, 15] A5=[16,18]
B is the same as A's ranges, offset by 1:
B: B1=[1,4] B2=[5,8] B3=[9,12] B4=[13,16] B5=[17,19]
Let's say we are finished flushing output until position 10 in the compaction.
We are currently working on A3 and B3, so obviously those cannot be deleted.
Because B2 overlaps with A3, we cannot delete B2 either.
Otherwise, B2 could have a GC'able tombstone that shadows data in A3, and after
B2 is gone, dead data in A3 could be resurrected *on failure*.
Now, A2 overlaps with B2 which we couldn't delete yet, so we can't delete A2.
Now A2 overlaps with B1 so we can't delete B1. And B1 overlaps with A1 so
we can't delete A1. So we can't delete any fragment.
The problem with this approach is obvious, fragments can potentially not be
released due to data dependency, so incremental compaction efficiency is
severely reduced.
To fix it, let's not purge GC'able tombstones right away in the mutation
compactor step. Instead, let's have compaction writing them to a separate
sstable run that would be deleted in the end of compaction.
By making sure that tombstone information from all compacting sstables is not
lost, we no longer need to have incremental compaction imposing lots of
restriction on which fragments could be released. Now, any sstable which data
is safe in a new sstable can be released right away. In addition, incremental
compaction will only take place if compaction procedure is working with one
multi-fragment sstable run at least.
Fixes#4531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
* seastar 1f68be436f...e888b1df9c (8):
> sharded: Make map work with mapper that returns a future
> cmake: Remove FindBoost.cmake
> Reduce noncopyable_function instruction cache footprint
> doc: add Loops section to the tutorial
> Merge "Move file related code out of reactor" from Asias
> Merge "Move the io_queue code out of reactor" from Asias
> cmake: expose seastar_perf_testing lib
> future: class doc: explain why discarding a future is bad
- main.cc now includes new file io_queue.hh
- perf tests now include seastar perf utilities via user, not
system, includes since those are not exported
Merged patch set from Piotr Sarna:
Refs #5046
This commit adds handling "Authorization:" header in incoming requests.
The signature sent in the authorization is recomputed server-side
and compared with what the client sent. In case of a mismatch,
UnrecognizedClientException is returned.
The signature computation is based on boto3 Python implementation
and uses gnutls to compute HMAC hashes.
This series is rebased on a previous HTTPS series in order to ease
merging these two. As such, it depends on the HTTPS series being
merged first.
Tests: alternator(local, remote)
The series also comes with a simple authorization test and a docs update.
Piotr Sarna (6):
alternator: migrate split() function to string_view
alternator: add computing the auth signature
config: add alternator_enforce_authorization entry
alternator: add verifying the auth signature
alternator-test: add a basic authorization test case
docs: update alternator authorization entry
alternator-test/test_authorization.py | 34 ++++++++
configure.py | 1 +
alternator/{server.hh => auth.hh} | 22 ++---
alternator/server.hh | 3 +-
db/config.hh | 1 +
alternator/auth.cc | 88 ++++++++++++++++++++
alternator/server.cc | 112 +++++++++++++++++++++++---
db/config.cc | 1 +
main.cc | 2 +-
docs/alternator/alternator.md | 7 +-
10 files changed, 241 insertions(+), 30 deletions(-)
create mode 100644 alternator-test/test_authorization.py
copy alternator/{server.hh => auth.hh} (58%)
create mode 100644 alternator/auth.cc
Before this change, when populating non-system keyspaces, each data
directory was scanned and for each entry (keyspace directory),
a keyspace was populated. This was done in a serial fashion - populating
of one keyspace was not started until the previous one was done.
Loading keyspaces in such fashion can introduce unnecessary waiting
in case of a large number of keyspaces in one data directory. Population
process is I/O intensive and barely uses CPU.
This change enables parallel loading of keyspaces per data directory.
Populating the next keyspace does not wait for the previous one.
A benchmark was performed measuring startup time, with the following
setup:
- 1 data directory,
- 200 keyspaces,
- 2 tables in each keyspace, with the following schema:
CREATE TABLE tbl (a int, b int, c int, PRIMARY KEY(a, b))
WITH CLUSTERING ORDER BY (b DESC),
- 1024 rows in each table, with values (i, 2*i, 3*i) for i in 0..1023,
- ran on 6-core virtual machine running on i7-8750H CPU,
- compiled in dev mode,
- parameters: --smp 6 --max-io-requests 4 --developer-mode=yes
--datadir $DIR --commitlog-directory $DIR
--hints-directory $DIR --view-hints-directory $DIR
The benchmark tested:
- boot time, by comparing timestamp of the first message in log,
and timestamp of the following message:
"init - Scylla version ... initialization completed."
- keyspace population time, by comparing timestamps of messages:
"init - loading non-system sstables"
and
"init - starting view update generator"
The benchmark was run 5 times for sequential and parallel version,
with the following results:
- sequential: boot 31.620s, keyspace population 6.051s
- parallel: boot 29.966s, keyspace population 4.360s
Keyspace population time decreased by ~27.95%, and overall boot time
by about ~5.23%.
Tests: unit(release)
Fixes#2007
The signature sent in the "Authorization:" header is now verified
by computing the signature server-side with a matching secret key
and confirming that the signatures match.
Currently the secret key is hardcoded to be "whatever" in order
to work with current tests, but it should be replaced
by a proper key store.
Refs #5046
The config entry will be used to turn authorization for alternator
requests on and off. The default is currently off, since the key store
is not implemented yet.
A function for computing the auth signature from user requests
is added, along with helper functions. The implementation
is based on gnutls's HMAC.
Refs #5046
The implementation of string split was based on sstring type for
simplicity, but it turns out that more generic std::string_view
will be beneficial later to avoid unneeded string copying.
Unfortunately boost::split does not cooperate well with string views,
so a simple manual implementation is provided instead.
Schema changes can have big effects on performance, typically it should
be a rare event.
It is usefull to monitor how frequently the schema changed.
This patch adds a counter that increases each time a schema changed.
After this patch the metrics would look like:
scylla_database_schema_changed{shard="0",type="derive"} 2
Fixes#4785
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We can use the reader::peek() to check if the reader contains any data.
If not, do not open the rpc stream connection. It helps to reduce the
port usage.
Refs: #4943
Both in a single-statement transaction and in a batch
we expect that serial consistency is provided. Move the
check to query_options class and make it available for
reuse.
Keep get_serial_consistency() around for use in
transport/server.cc.
Message-Id: <20191006154532.54856-2-kostja@scylladb.com>
When another node is reported to be down, view updates queued
for it are cancelled, but some of them may already be initiated.
Right now, cancelling such a write resulted in an exception,
but on conceptual level it's not really an exception, since
this behaviour is expected.
Previous version of this patch was based on introducing a special
exception type that was later handled specially, but it's not clear
if it's a good direction. Instead, this patch simply makes this
path non-exceptional, as was originally done by Nadav in the first
version of the series that introduced handling unstarted write
cancellations. Additionally, a message containing the information
that a write is cancelled is logged with debug level.
README.md has 3 fixes applied:
- s/alternator_tls_port/alternator_https_port
- conf directory is mentioned more explicitly
- it now correctly states that the self-signed certificate
warning *is* explicitly ignored in tests
Message-Id: <e5767f7dbea260852fc2fa9b613e1bebf490cc78.1570444085.git.sarna@scylladb.com>
"
Fixes#5134, Eviction concurrent with preempted partition entry update after
memtable flush may allow stale data to be populated into cache.
Fixes#5135, Cache reads may miss some writes if schema alter followed by a
read happened concurrently with preempted partition entry update.
Fixes#5127, Cache populating read concurrent with schema alter may use the
wrong schema version to interpret sstable data.
Fixes#5128, Reads of multi-row partitions concurrent with memtable flush may
fail or cause a node crash after schema alter.
"
* tag 'fix-cache-issues-with-schema-alter-and-eviction-v2' of github.com:tgrabiec/scylla:
tests: row_cache: Introduce test_alter_then_preempted_update_then_memtable_read
tests: row_cache_stress_test: Verify all entries are evictable at the end
tests: row_cache_stress_test: Exercise single-partition reads
tests: row_cache_stress_test: Add periodic schema alters
tests: memtable_snapshot_source: Allow changing the schema
tests: simple_schema: Prepare for schema altering
row_cache: Record upgraded schema in memtable entries during update
memtable: Extract memtable_entry::upgrade_schema()
row_cache, mvcc: Prevent locked snapshots from being evicted
row_cache: Make evict() not use invalidate_unwrapped()
mvcc: Introduce partition_snapshot::touch()
row_cache, mvcc: Do not upgrade schema of entries which are being updated
row_cache: Use the correct schema version to populate the partition entry
delegating_reader: Optimize fill_buffer()
row_cache, memtable: Use upgrade_schema()
flat_mutation_reader: Introduce upgrade_schema()
Merged patch series from Piotr Sarna:
This series adds HTTPS support for Alternator.
The series comes with --https option added to alternator-test, which makes
the test harness run all the tests with HTTPS instead of HTTP. All the tests
pass, albeit with security warnings that a self-signed x509 certificate was
used and it should not be trusted.
Fixes#5042
Refs scylladb/seastar#685
Patches:
docs: update alternator entry on HTTPS
alternator-test: suppress the "Unverified HTTPS request" warning
alternator-test: add HTTPS info to README.md
alternator-test: add HTTPS to test_describe_endpoints
alternator-test: add --https parameter
alternator: add HTTPS support
config: add alternator HTTPS port
* seastar c21a7557f9...1f68be436f (6):
> scheduling: Add per scheduling group data support
> build: Include dpdk as a single object in libseastar.a
> sharded: fix foreign_ptr's move assignment
> build: Fix DPDK libraries linking in pkg-config file
> http server: https using tls support
> Make output_stream blurb Doxygen
The BEGINS_WITH condition in conditional updates (via Expected) requires
that the given operand be either a string or a binary. Any other operand
should result in a validation exception - not a failed condition as we
generate now.
This patch fixes the test for this case so it will succeed against
Amazon DynamoDB (before this patch it fails - this failure was masked by
a typo before commit 332ffa77ea). The patch
then fixes our code to handle this case correctly.
Note that BEGINS_WITH handling of wrong types is now asymmetrical: A bad
type in the operand is now handled differently from a bad type in the
attribute's value. We add another check to the test to verify that this
is the case.
Fixes#5141
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191006080553.4135-1-nyh@scylladb.com>
When debugging one constantly has to inspect object for which only
a "virtual pointer" is available, that is a pointer that points to a
common parent class or interface.
Finding the concrete type and downcasting the pointer is easy enough but
why do it manually when it is possible to automate it trivially?
$downcast_vptr() returns any virtual pointer given to it, casted to the
actual concrete object.
Exlample:
(gdb) p $1
$2 = (flat_mutation_reader::impl *) 0x60b03363b900
(gdb) p $downcast_vptr(0x60b03363b900)
$3 = (combined_mutation_reader *) 0x60b03363b900
# The return value can also be dereferenced on the spot.
(gdb) p *$downcast_vptr($1)
$4 = {<flat_mutation_reader::impl> = {_vptr.impl = 0x46a3ea8 <vtable
for combined_mutation_reader+16>, _buffer = {_impl = {<std::al...
Dereferencing an `seastar::lw_shared_ptr` is another tedious manual
task. The stored pointer (`_p`) has to be casted to the right subclass
of `lw_shared_ptr_counter_base`, which involves inspecting the code,
then make writing a cast expression that gdb is willing to parse. This
is something machines are so much better at doing.
`$dereference_lw_shared_ptr` returns a pointer to the actual pointed-to
object, given an instance of `seastar::lw_shared_ptr`.
Example:
(gdb) p $1._read_context
$2 = {_p = 0x60b00b068600}
(gdb) p $dereference_lw_shared_ptr($1._read_context)
$3 = {<seastar::enable_lw_shared_from_this<cache::read_context>>
= {<seastar::lw_shared_ptr_counter_base> = {_count = 1}, ...
Make all the parameters of the sampling tweakable via command line
arguments. I strived to keep full backward compatibility, but due to the
limitations of `argparse` there is one "breaking" change. The optional
positional size argument is now a non-positional argument as `argparse`
doesn't support optional positional arguments.
Added documentation for both the command itself as well as for all the
arguments.
make_single_key_reader() currently doesn't actually create
single-partition readers because it doesn't set
mutation_reader::forwarding::no when it creates individual
readers. The readers will default to mutation_reader::forwarding::yes
and actually create scanning readers in preparation for
fast-forwarding across partitions.
Fix by passing mutation_reader::forwarding::no.
Currently, methods of simple_schema assume that table's schema doesn't
change. Accessors like get_value() assume that rows were generated
using simple_schema::_s. Because if that, the column_definition& for
the "v" column is cached in the instance. That column_definiion&
cannot be used to access objects created with a different schema
version. To allow using simple_schema after schema changes,
column_definition& caching is now tagged with the table schema version
of origin. Methods which access schema-dependent objects, like
get_value(), are now accepting schema& corresponding to the objects.
Also, it's now possible to tell simple_schema to use a different
schema version in its generator methods.
Cache update may defer in the middle of moving of partition entry
from a flushed memtable to the cache. If the schema was changed since
the entry was written, it upgrades the schema of the partition_entry
first but doesn't update the schema_ptr in memtable_entry. The entry
is removed from the memtable afterward. If a memtable reader
encounters such an entry, it will try to upgrade it assuming it's
still at the old schema.
That is undefined behavior in general, which may include:
- read failures due to bad_alloc, if fixed-size cells are interpreted
as variable-sized cells, and we misinterpret a value for a huge
size
- wrong read results
- node crash
This doesn't result in a permanent corruption, restarting the node
should help.
It's the more likely to happen the more rows there are in a
partition. It's unlikely to happen with single-row partitions.
Introduced in 70c7277.
Fixes#5128.
If the whole partition entry is evicted while being updated from the
memtable, a subsequent read may populate the partition using the old
version of data if it attempts to do it before cache update advances
past that partition. Partial eviction is not affected because
populating reads will notice that there is a newer snapshot
corresponding to the updater.
This can happen only in OOM situations where the whole cache gets evicted.
Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.
Introduced in 70c7277.
Fixes#5134.
invalidate_unwrapped() calls cache_entry::evict(), which cannot be
called concurrently with cache update. invalidate() serializes it
properly by calling do_update(), but evict() doesn't. The purpose of
evict() is to stress eviction in tests, which can happen concurrently
with cache update. Switch it to use memory reclaimer, so that it's
both correct and more realistic.
evict() is used only in tests.
When a read enters a partition entry in the cache, it first upgrades
it to the current schema of the cache. The same happens when an entry
is updated after a memtable flush. Upgrading the entry is currently
performed by squashing all versions and replacing them with a single
upgraded version. That has a side effect of detaching all snapshots
from the partition entry. Partition entry update on memtable flush is
writing into a snapshot. If that snapshot is detached by a schema
upgrade, the entry will be missing writes from the memtable which fall
into continuous ranges in that entry which have not yet been updated.
This can happen only if the update of the entry is preempted and the
schema was altered during that, and a read hit that partition before
the update went past it.
Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.
The problem is fixed by locking updated entries and not upgrading
schema of locked entries. cache_entry::read() is prepared for this,
and will upgrade on-the-fly to the cache's schema.
Fixes#5135
The sstable reader which populates the partition entry in the cache is
using the schema of the partition entry snapshot, which will be the
schema of the cache at the time the partition was entered. If there
was a schema change after the cache reader entered the partition but
before it created the sstable reader, the cache populating reader will
interpret sstable fragments using the wrong schema version. That is
more likely if partitions have many rows, and the front of the
partition is populated. With single-row partitions that's unlikely to
happen.
That is undefined behavior in general, which may include:
- read failures due to bad_alloc, if fixed-size cells are
interpreted as variable-sized cells, and we misinterpret
a value for a huge size
- wrong read results
- node crash
This doesn't result in a permanent corruption, restarting the node
should help.
Fixes#5127.
Use move_buffer_content_to() which is faster than fill_buffer_from()
because it doesn't involve popping and pushing the fragments across
buffers. We save on size estimation costs.
Running with --https and a self-signed certificate results in a flood
of expected warnings, that the connection is not to be trusted.
These warnings are silenced, as users runing a local test with --https
usually use self-signed certificates.
The test_describe_endpoints test spawns another client connection
to the cluster, so it needs to be HTTPS-aware in order to work properly
with --https parameter.
Running with --https parameter will result in sending the requests
via HTTPS instead of HTTP. By default, port 8043 is used for a local
cluster. Before running pytest --https, make sure that Scylla
was properly configured to initialize a HTTPS alternator server
by providing the alternator_tls_port parameter.
The HTTPS-based connection runs with verification disabled,
otherwise it would not work with self-signed certificates,
which are useful for tests.
By providing a server based on a TLS socket, it's now possible
to serve HTTPS requests in alternator. The HTTPS server is enabled
by setting its port in scylla.yaml: alternator_tls_port=XXXX.
Alternator TLS relies on the existing TLS configuration,
which is provided by certificate, keyfile, truststore, priority_string
options.
Fixes#5042
The test test_update_expression_function_nesting() fails because DynamoDB
don't allow an expression like list_append(list_append(:val1, :val2), :val3)
but Alternator doesn't check for this (and supports this expression).
The "xfail" message was outdated, suggesting that the test fails because
the "SET" expression isn't supported - but it is. So replace the message
by a more accurate one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190915104708.30471-1-nyh@scylladb.com>
Merged patch set from Dejan Mircevski implementing some of the
missing operators for Expected: NE, IN, NULL and NOT_NULL.
Patches:
alternator: Factor out Expected operand checks
alternator: Implement NOT_NULL operator in Expected
alternator: Implement NULL operator in Expected
alternator: Fix expected_1_null testcase
alternator: Implement IN operator in Expected
alternator: Implement NE operator in Expected
alternator: Factor out common code in Expected
Frozen empty lists/map/sets are not equal to null value,
whil multi-cell empty lists/map/sets are equal to null values.
Return a NULL value for an empty multi-cell set or list
if we know the receiver is not frozen - this makes it
easy to compare the parameter with the receiver.
Add a test case for inserting an empty list or set
- the result is indistinguishable from NULL value.
Message-Id: <20191003092157.92294-2-kostja@scylladb.com>
"
Fix races that may lead to use-after-free events and file system level exceptions
during shutdown and drain.
The root cause of use-after-free events in question is that space_watchdog blocks on
end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as
it's accessed even if the corresponding end_point_hints_manager instance
is destroyed in the context of manager::drain_for().
File system exceptions may occur when space_watchdog attempts to scan a
directory while it's being deleted from the drain_for() context.
In case of such an exception new hints generation is going to be blocked
- including for materialized views, till the next space_watchdog round (in 1s).
Issues that are fixed are #4685 and #4836.
Tested as follows:
1) Patched the code in order to trigger the race with (a lot) higher
probability and running slightly modified hinted handoff replace
dtest with a debug binary for 100 times. Side effect of this
testing was discovering of #4836.
2) Using the same patch as above tested that there are no crashes and
nodes survive stop/start sequences (they were not without this series)
in the context of all hinted handoff dtests. Ran the whole set of
tests with dev binary for 10 times.
"
* 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla:
hinted handoff: fix a race on a directory removal between space_watchdog and drain_for()
hinted handoff: make taking file_update_mutex safe
db::hints::manager::drain_for(): fix alignment
db::hints::manager: serialize calls to drain_for()
db::hints: cosmetics: identation and missing method qualifier
The operation after gate.enter() in tracker::start() can fail and throw,
we should call gate.leave() in such case to avoid unbalanced enter and
leave calls. tracker::done() has similar issue too.
Fix it by removing the gate enter and leave logic in tracker start and
done. A helper tracker::run() is introduced to take care of the gate and
repair status.
In addition, the error log is improved. It now logs exceptions on all
shards in the summary. e.g.,
[shard 0] repair - repair id 1 failed: std::runtime_error
({shard 0: std::runtime_error (error0), shard 1: std::runtime_error (error1)})
Fixes#5074
Currently, the population stat is not increased for entries that are
evicted immediately on insert, however the code that does the eviction
still decreases the population stat, leading to an imbalance and in some
cases the underflow of the population stat. To fix, unconditionally
increase the population stat upon inserting an entry, regardless of
whether it is immediately evicted or not.
Fixes: #5123
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191001153215.82997-1-bdenes@scylladb.com>
Put all AttributeValuelist size verification under
verify_operand_count(), rather than have some cases invoke
verify_operand_count() while others verify it in check_*() functions.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Add check_IN() and a switch case that invokes it. Reactivate IN
tests. Add a testcase for non-scalar attribute values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Recognize "NE" as a new operator type, add check_NE() function, invoke
it in verify_expected_one(), and reactivate NE tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Operand-count verification will be repeated a lot as more operators
are implemented, so factor it out into verify_operand_count().
Also move `got` null checks to check_* functions, which reduces
duplication at call sites.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
While a managed_ref emulates a reference more closely than it does
a pointer, it is still nullable, so add a get() (similar to
unique_ptr::get()) that can be nullptr if the reference is null.
The immediate use will be mutation_partition::_static_row, which
is often empty and takes up about 10% of a cache entry.
The example Python code had wrong indentation, and wouldn't actually
work if naively copy-pasted. Noticed by Noam Hasson.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190929091440.28042-1-nyh@scylladb.com>
"
This is a collection of assorted patches that will be needed for LWT.
Most of them are trivial, but one touches a lot of files, so have a
good chance to cause rebase headache (I already had to rebase it on
top of Alternator). Lets push them earlier instead of carrying them in
the lwt branch.
"
* 'gleb/lwt-prepare-v2' of github.com:scylladb/seastar-dev:
lwt: make _last_timestamp_micros static
lwt: Add client_state::get_timestamp_for_paxos() function
lwt: Pass client_state reference all the way to storage_proxy::query
exceptions: Add a constructor for unavailable_exception that allows providing a custom message
serializer: Add std::variant support
lwt: Add missing functions to utils/UUID_gen.hh
"
This is the second version of the patch series. The previous one was just the second patch, this one adds more tests an another patch to make it easier to test that the new code has the same behavior as the old one.
"
* 'espindola/overflow-is-intentional' of https://github.com/espindola/scylla:
types: Simplify and explain from_varint_to_integer
Add more cast tests
Affects single-partition reads only.
Refs #5113
When executing a query on the replica we do several things in order to
narrow down the sstable set we read from.
For tables which use LeveledCompactionStrategy, we store sstables in
an interval set and we select only sstables whose partition ranges
overlap with the queried range. Other compaction strategies don't
organize the sstables and will select all sstables at this stage. The
reasoning behind this is that for non-LCS compaction strategies the
sstables' ranges will typically overlap and using interval sets in
this case would not be effective and would result in quadratic (in
sstable count) memory consumption.
The assumption for overlap does not hold if the sstables come from
repair or streaming, which generates non-overlapping sstables.
At a later stage, for single-partition queries, we use the sstables'
bloom filter (kept in memory) to drop sstables which surely don't
contain given partition. Then we proceed to sstable indexes to narrow
down the data file range.
Tables which don't use LCS will do unnecessary I/O to read index pages
for single-partition reads if the partition is outside of the
sstable's range and the bloom filter is ineffective (Refs #5112).
This patch fixes the problem by consulting sstable's partition range
in addition to the bloom filter, so that the non-overlapping sstables
will be filtered out with certainty and not depend on bloom filter's
efficiency.
It's also faster to drop sstables based on the keys than the bloom
filter.
Tests:
- unit (dev)
- manual using cqlsh
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>
The method sstable::estimated_keys_for_range() was severely
under-estimating the number of partitions in an sstable for a given
token range.
The first reason is that it underestimated the number of sstable index
pages covered by the range, by one. In extreme, if the requested range
falls into a single index page, we will assume 0 pages, and report 1
partition. The reason is that we were using
get_sample_indexes_for_range(), which returns entries with the keys
falling into the range, not entries for pages which may contain the
keys.
A single page can have a lot of partitions though. By default, there
is a 1:20000 ratio between summary entry size and the data file size
covered by it. If partitions are small, that can be many hundreds of
partitions.
Another reason is that we underestimate the number of partitions in an
index page. We multiply the number of pages by:
(downsampling::BASE_SAMPLING_LEVEL * _components->summary.header.min_index_interval)
/ _components->summary.header.sampling_level
Using defaults, that means multiplying by 128. In the cassandra-stress
workload a single partition takes about 300 bytes in the data file and
summary entry is 22 bytes. That means a single page covers 22 * 20'000
= 440'000 bytes of the data file, which contains about 1'466
partitions. So we underestimate by an order of magnitude.
Underestimating the number of partitions will result in too small
bloom filters being generated for the sstables which are the output of
repair or streaming. This will make the bloom filters ineffective
which results in reads selecting more sstables than necessary.
The fix is to base the estimation on the number of index pages which
may contain keys for the range, and multiply that by the average key
count per index page.
Fixes#5112.
Refs #4994.
The output of test_key_count_estimation:
Before:
count = 10000
est = 10112
est([-inf; +inf]) = 512
est([0; 0]) = 128
est([0; 63]) = 128
est([0; 255]) = 128
est([0; 511]) = 128
est([0; 1023]) = 128
est([0; 4095]) = 256
est([0; 9999]) = 512
est([5000; 5000]) = 1
est([5000; 5063]) = 1
est([5000; 5255]) = 1
est([5000; 5511]) = 1
est([5000; 6023]) = 128
est([5000; 9095]) = 256
est([5000; 9999]) = 256
est(non-overlapping to the left) = 1
est(non-overlapping to the right) = 1
After:
count = 10000
est = 10112
est([-inf; +inf]) = 10112
est([0; 0]) = 2528
est([0; 63]) = 2528
est([0; 255]) = 2528
est([0; 511]) = 2528
est([0; 1023]) = 2528
est([0; 4095]) = 5056
est([0; 9999]) = 10112
est([5000; 5000]) = 2528
est([5000; 5063]) = 2528
est([5000; 5255]) = 2528
est([5000; 5511]) = 2528
est([5000; 6023]) = 5056
est([5000; 9095]) = 7584
est([5000; 9999]) = 7584
est(non-overlapping to the left) = 0
est(non-overlapping to the right) = 0
Tests:
- unit (dev)
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927141339.31315-1-tgrabiec@scylladb.com>
`dbuild` was recently (24c732057) updated to run in interactive mode
when given no arguments; we can now update the README to mention that.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the toppartitions operation gathers results, it copies partition
keys with their schema_ptr:s. When these schema_ptr:s are copies
or destroyed, they can cause leaks or premature frees of the schema
in its original shard since reference count operations in are not atomic.
Fix that by converting the schema_ptr to a global_schema_ptr during
transportation.
Fixes#5104 (direct bug)
Fixes#5018 (schema prematurely freed, toppartitions previously executed on that node)
Fixes#4973 (corrupted memory pool of the same size class as schema, toppartitions previously executed on that node)
Tests: new test added that fails with the existing code in debug mode,
manual toppartitions test
Copying schema_ptrs across shards results in memory corruption since
lw_shared_ptr does not use atomic operations for reference counts.
Prevent that by converting schema_ptr:s to global_schema_ptr:s before
shipping them across shards in the map operation, and converting them
back to local schema_ptr:s in the reduce operation.
This allows keys from different stages in the schema's like to compare equal.
This is safe since the partition key cannot change, unlike the rest of the schema.
More importantly, it will allow us to compare keys made local after a pass through
global_schema_ptr, which does not guarantee that the schema_ptr conversion will be
the same even when starting with the same global_schema_ptr.
Throwing move constructors are a a pain; so we should try to make
them noexcept. Currently, global_schema_ptr's move constructor
throws an exception if used illegaly (moving from a different shard);
this patch changes it to an assert, on the grounds that this error
is impossible to recover from.
The direct motivation for the patch is the desire to store objects
containing a global_schema_ptr in a chunked_vector, to move lists
of partition keys across shards for the topppartitions functionality.
chunked_vector currently requires noexcept move constructors for its
value_type.
When a user type changes we were not recreating other uses types that
use it. This patch series fixes that and makes it clear which code is
responsible for it.
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.
At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.
Fixes#5049
If each client_state has its own copy of the variable two clients may
generate timestamps that clash and needlessly create contention. Making
the variable shared between all client_state on the same shard will make
sure this will not happen to two clients on the same shard. It may still
happen for two client on two different shards or two different nodes.
Paxos needs a unique timestamp that is greater than some other
timestamp, so that the next round had more chances to succeed.
Add a function that returns such a timestamp.
client_state holds a state to generate monotonically increasing unique
timestamp. Queries with a SERIAL consistency level need it to generate
a paxos round.
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.
At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.
Fixes#5049
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The way schema changes propagate is by editing the system tables and
comparing the before and after state.
When a user type A uses another user type B and we modify B, the
representation of A in the system table doesn't change, so this code
was not producing any changes on the diff that the receiving side
uses.
Deleting it makes it clear that it is the receiver's responsibility to
handle dependent user types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch db::cql_type_parser::raw_builder creates a local copy
of the list of existing types and uses that internally. By doing that
build() should have no observable behavior other than returning the
new types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We were never passing a null pointer and never saving a copy of the
lw_shared_ptr. Passing a reference is more flexible as not all callers
are required to hold the user_types_metadata in a lw_shared_ptr.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
* seastar b56a8c5045...c21a7557f9 (3):
> net: socket::{set,get}_reuseaddr() should not be virtual
> iotune: print verbose message in case of shutdown errors
> iotune: close test file on shutdown
Fixes#4946.
1. Add assert in remove_response_handler to make crashes like in #5032 easier to understand.
2. Lookup the view_update_write_response_handler id before calling timeout_cb and tolerate it not found.
Just log a warning if this happened.
Fixes#5032
"
Currently affects only counter tables.
Introduced in 27014a2.
mutation_partition(s, mp) is incorrect because it uses s to interpret
mp, while it should use mp_schema.
We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during table schema altering when we receive the
mutation from a node which hasn't processed the schema change yet.
This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.
Fixes#5095.
"
* 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla:
mvcc: Fix incorrect schema verison being used to copy the mutation when applying
mutation_partition: Track and validate schema version in debug builds
tests: Use the correct schema to access mutation_partition
Currently affects only counter tables.
Introduced in 27014a2.
mutation_partition(s, mp) is incorrect, because it uses s to interpret
mp, while it should use mp_schema.
We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during alter when we receive the
mutation from a node which hasn't processed the schema change yet.
This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.
Fixes#5095.
This patch makes mutation_partition validate the invariant that it's
supposed to be accessed only with the schema version which it conforms
to.
Refs #5095
* seastar e51a1a8ed9...b56a8c5045 (3):
> net: add support for UNIX-domain sockets
> future: Warn on promise::set_exception with no corresponding future or task
> Merge "Handle exceptions in repeat_until_value and misc cleanups" from Rafael
Handle a race where a write handler is removed from _response_handlers
but not yet from _view_update_handlers_list.
Fixes#5032
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor remove_response_handler_entry out of remove_response_handler,
to be called on a valid iterator found by _response_handlers.find(id).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Help identify cases like seen in #5032 where the handler id
wasn't found from the on_down -> timeout_cb path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
compaction_manager::perform_sstable_upgrade() fails when it feeds
compaction mechanism with shared sstables. Shared sstables should
be ignored when performing upgrade and so wait for reshard to pick
them up in parallel. Whenever a shared sstable is brought up either
on restart or via refresh, reshard procedure kicks in.
Reshard picks the highest supported format so the upgrade for
shared sstable will naturally take place.
Fixes#5056.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190925042414.4330-1-raphaelsc@scylladb.com>
- Update the outdated comments in do_stop_gossiping. It was
storage_service not storage_proxy that used the lock. More
importantly, storage_service does not use it any more.
- Drop the unused timer_callback_lock and timer_callback_unlock API
- Use with_semaphore to make sure the semaphore usage is balanced.
- Add log in gossiper::do_stop_gossiping when it tries to take the
semaphore to help debug hang during the shutdown.
Refs: #4891
Refs: #4971
A documentation file that is intended to be a place for anything
debugging related: getting started tutorial, tips and tricks and
advanced guides.
For now it contains a short introductions, some selected links to
more in-depth documentation and some trips and tricks that I could think
off the top of my head.
One of those tricks describes how to load cores obtained from
relocatable packages inside the `dbuild` container. I originally
intended to add that to `tools/toolchain/README.md` but was convinced
that `docs/debugging.md` would be a better place for this.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190924133110.15069-1-bdenes@scylladb.com>
Recently we have seen a case where the population stat of the cache was
corrupt, either due to misaccounting or some more serious corruption.
When debugging something like that it would have been useful to know how
many items have been inserted to the cache. I also believe that such a
counter could be useful generally as well.
Refs: #4918
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190924083429.43038-1-bdenes@scylladb.com>
"
We observed an abort on bad_alloc which was not caused by real OOM,
but could be explained by cache region being locked from a different
shard, which is not allowed, concurrently with memory reclamation.
It's impossible now to prove this, or, if that was indeed the case, to
determine which code path was attempting such lock. This patch adds an
assert which would catch such incorrect locking at the attempt.
Refs #4978
Tests:
- unit (dev, release, debug)
"
* 'assert-no-xshard-lsa-locking' of https://github.com/tgrabiec/scylla:
lsa: Assert no cross-shard region locking
tests: Make managed_vector_test a seastar test
* seastar 2a526bb120...e51a1a8ed9 (2):
> rpc: introduce rpc::tuple as a way to move away from variadic future
> shared_future: don't warn on broken futures
Make it easier for the IDE to resolve references to the seastar
namespace. In any case include files should be stand-alone and not
depend on previously included files.
The build directory is meaningless, since it is typically some
directory in a continuous integration server. That means someone
debugging the relocatable package needs to issue the gdb command
'set substitute-path' with the correct arguments, or they lose
source debugging. Doing so in the relocatable package build saves
this step.
The default build is not modified, since a typical local build
benefits from having the paths hardcoded, as the debugger will
find the sources automatically.
We observed an abort on bad_alloc which was not caused by real OOM,
but could be explained by cache region being locked from a different
shard, which is not allowed, concurrently with memory reclamation.
It's impossible now to prove this, or, if that was indeed the case, to
determine which code path was attempting such lock. This patch adds an
assert which would catch such incorrect locking at the attempt.
Refs #4978
LCS demotes a SSTable from a given level when it thinks that level is inactive.
Inactive level means N rounds (compaction attempt) without any activity in it,
in other words, no SSTable has been promoted to it.
The problem happens because the metadata that tracks inactiveness of each level
can be incorrectly updated when there's an ongoing compaction. LCS has parallel
compaction disabled. So if a table finds itself running a long operation like
cleanup that blocks minor compaction, LCS could incorrectly think that many
levels need demotion, and by the time cleanup finishes, some demotions would
incorrectly take place.
This problem is fixed by only updating the counter that tracks inactiveness
when compaction completes, so it's not incorrectly updated when there's an
ongoing compaction for the table.
Fixes#4919.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190917235708.8131-1-raphaelsc@scylladb.com>
A recent fix to #3767 limited the amount of ranges that
can return from query_ranges_to_vnodes_generator. This with
the combination of a large amount of token ranges can lead to
an infinite recursion. The algorithm multiplies by factor of
2 (actualy a shift left by one) the amount of requested
tokens in each recursion iteration. As long as the requested
number of ranges is greater than 0, the recursion is implicit,
and each call is scheduled separately since the call is inside
a continuation of a map reduce.
But if the amount of iterations is large enough (~32) the
counter for requested ranges zeros out and from that moment on
two things will happen:
1. The counter will remain 0 forever (0*2 == 0)
2. The map reduce future will be immediately available and this
will result in the continuation being invoked immediately.
The latter causes the recursive call to be a "regular" recursive call
thus, through the stack and not the task queue of the scheduler, and
the former causes this recursion to be infinite.
The combination creates a stack that keeps growing and eventually
overflows resulting in undefined behavior (due to memory overrun).
This patch prevent the problem from happening, it limits the growth of
the concurrency counter beyond twice the last amount of tokens returned
by the query_ranges_to_vnodes_generator.And also makes sure it is not
get stuck at zero.
Testing: * Unit test in dev mode.
* Modified add 50 dtest that reproduce the problem
Fixes#4944
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190922072838.14957-1-eliransin@scylladb.com>
Before this patch, if the _gate is closed, with_gate throws and
forward_to is not executed. When the promise<> p is destroyed it marks
its _task as a broken promise.
What happens next depends on the branch.
On master, we warn when the shared_future is destroyed, so this patch
changes the warning from a broken_promise to a gate closed.
On 3.1, we warn when the promises in shared_future::_peers are
destroyed since they no longer have a future attached: The future that
was attached was the "auto f" just before the with_gate call, and it
is destroyed when with_gate throws. The net result is that this patch
fixes the warning in 3.1.
I will send a patch to seastar to make the warning on master more
consistent with the warning in 3.1.
Fixes#4394
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190917211915.117252-1-espindola@scylladb.com>
Scylla currently crashes if we run manual operations like nodetool
compact with the controller disabled. While we neither like nor
recommend running with the controller disabled, due to some corner cases
in the controller algorithm we are not yet at the point in which we can
deprecate this and are sometimes forced to disable it.
The reason for the crash is that manual operations will invoke
_backlog_of_shares, which returns what is the backlog needed to
create a certain number of shares. That scan the existing control
points, but when we run without the controller there are no control
points and we crash.
Backlog doesn't matter if the controller is disabled, and the return
value of this function will be immaterial in this case. So to avoid the
crash, we return something right away if the controller is disabled.
Fixes#5016
Signed-off-by: Glauber Costa <glauber@scylladb.com>
gdb searches for libthread_db.so using its canonical name of libthread_db.so.1 rather
than the file name of libthread_db-1.0.so, so use that name to store the file in the
archive.
Fixes#4996.
* seastar b3fb4aaab3...84d8e9fe9b (8):
> Use aio fsync if available
> Merge "fix some tcp connection bugs and add reuseaddr option to a client socket" from Gleb
> lz4: use LZ4_decompress_safe
> reactor: document seastar::remove_file()
> core/file.hh: remove redundant std::move()
> core/{file,sstring}: do not add `const` to return value
> http/api_docs: always call parent constructor
> Add input_stream blurb
Currently, if updating bookkeeping operations for view building fails,
we log the error message and continue. However, during shutdown,
some errors are more likely to happen due to existing issues
like #4384. To differentiate actual errors from semi-expected
errors during shutdown, the latter are now logged with a warning
level instead of error.
Fixes#4954
Shutdown routines are usually implemented via the deferred_action
mechanism, which runs a function in its destructor. We thus expect
the function to be noexcept, but unfortunately it's not always
the case. Throwing in the destructor results in terminating the program
anyway, but before we do that, the exception can be logged so it's
easier to investigate and pinpoint the issue.
Example output before the patch:
INFO 2019-09-10 12:49:05,858 [shard 0] view - Stopping view builder
terminate called without an active exception
Aborting on shard 0.
Backtrace:
0x000000000184a9ad
(...)
Example output after the patch:
INFO 2019-09-10 12:49:05,858 [shard 0] view - Stopping view builder
ERROR 2019-09-10 12:49:05,858 [shard 0] init - Unexpected error on shutdown: std::runtime_error (Hello there!)
terminate called without an active exception
Aborting on shard 0.
Backtrace:
0x000000000184a9ad
(...)
This simplifies the implementation of from_varint_to_integer and
avoids using the fact that a static_cast from cpp_int to uint64_t
seems to just keep the low 64 bits.
The boost release notes
(https://www.boost.org/users/history/version_1_67_0.html) implies that
the conversion function should return the maximum value a uint64_t can
hold if the original value is too large.
The idea of using a & with ~0 is a suggestion from the boost release
notes.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Update current results dictionary using the Metric.discover method.
New results are added and missing results are marked as absent.
(Both full metrics or specific keys)
Previously, with prometheous, each metric.update called query_list
resulting in O(n^2) when all metric were updated, like in the scylla_top
dtest - causing test timeout when testing debug build.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Commit log replay was bypassing memtable space back-pressure, and if
replay was faster than memtable flush, it could lead to OOM.
The fix is to call database::apply_in_memory() instead of
table::apply(). The former blocks when memtable space is full.
Fixes#4982.
Tests:
- unit (release)
- manual, replay with memtable flush failin and without failing
Message-Id: <1568381952-26256-1-git-send-email-tgrabiec@scylladb.com>
If the user supplies the 'replication_factor' to the 'NetworkTopologyStrategy' class,
it will expand into a replication factor for each existing DC for their convenience.
Resolves#4210.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
This reverts commit 7f64a6ec4b.
Fixes#5011
The reverted commit exposes #3760 for all schemas, not only those
which have UDTs.
The problem is that table schema deserialization now requires keyspace
to be present. If the replica hasn't received schema changes which
introduce the keyspace yet, the write will fail.
Mention on the top-level README.md that Scylla by default is compatible
with Cassandra, but also has experimental support for DynamoDB's API.
Provide links to alternator/alternator.md and alternator/getting-started.md
with more information about this feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190911080913.10141-1-nyh@scylladb.com>
"
In this patch set, written by Piotr Sarna and myself, we add Alternator - a new
Scylla feature adding compatibility with the API of Amazon DynamoDB(TM).
DynamoDB's API uses JSON-encoded requests and responses which are sent over
an HTTP or HTTPS transport. It is described in detail on Amazon's site:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/
Our goal is that any application written to use Amazon DynamoDB could
be run, unmodified, against Scylla with Alternator enabled. However, at this
stage the Alternator implementation is incomplete, and some of DynamoDB's
API features are not yet supported. The extent of Alternator's compatibility
with DynamoDB is described in the document docs/alternator/alternator.md
included in this patch set. The same document also describes Alternator's
design (and also points to a longer design document).
By default, Scylla continues to listen only to Cassandra API requests and not
DynamoDB API requests. To enable DynamoDB-API compatibility, you must set
the alternator-port configuration option (via command line or YAML) to the port on
which you wish to listen for DynamoDB API requests. For more information, see
docs/alternator/alternator.md. The document docs/alternator/getting-started.md
also contains some examples of how to get started with Alternator.
"
* 'alternator' of https://github.com/nyh/scylla: (272 commits)
Added comments about DAX, monitoring and more
alternator: fix usage of client_state
alternator-test: complete test_expected.py for rest of comparison operators
alternator-test: reproduce bug in Expected with EQ of set value
alternator: implement the Expected request parameter
alternator: add returning PAY_PER_REQUEST billing mode
alternator: update docs/alternator.md on GSI/LSI situation
Alternator: Add getting started document for alternator
move alternator.md to its own directory
alternator-test: add xfail test for GSI with 2 regular columns
alternator/executor.cc: Latencies should use steady_clock
alternator-test: fix LSI tests
alternator-test: fix test_describe_endpoints.py for AWS run
alternator-test: test_describe_endpoints.py without configuring AWS
alternator: run local tests without configuring AWS
alternator-test: add LSI tests
alternator-test: bump create table time limit to 200s
alternator: add basic LSI support
alternator: rename reserved column name "attrs"
alternator: migrate make_map_element_restriction to string view
...
This patch adds tests for all the missing comparion operators in the
Expected parameter (the old-style parameter for conditional operations).
All these new tests are now xfailing on Alternator (and succeeding on
DynamoDB), because these operators are not yet implemented in Alternator
(we only implemented EQ and BEGINS_WITH, so far - the rest are easy but
need to be implemented).
The test_expected.py is now hopefully comprehensive, covering the entire
feature set of the "Expected" parameter and all its various cases and
subcases.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190910092208.23461-1-nyh@scylladb.com>
Our implementation of the "EQ" operator in Expected (conditional
operation) just compares the JSON represntation of the values.
This is almost always correct, but unfortunately incorrect for
sets - where we can have two equal sets despite having a
different order.
This patch just adds an (xfailing) test for this bug.
The bug itself can be fixed in the future in one of several ways
including changing the implementation of EQ, or changing the
serialization of sets so they'll always be sorted in the same
way.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190909125147.16484-1-nyh@scylladb.com>
In this patch we implement the Expected parameter for the UpdateItem,
PutItem and DeleteItem operations. This parameter allows a conditional
update - i.e., do an update only if the existing value of the item
matches some condition.
This is the older form of conditional updates, but is still used by many
applications, including Amazon's Tic-Tac-Toe demo.
As usual, we do not yet provide isolation guarantees for read-modify-write
operations - the item is simply read before the modification, and there is
no protection against concurrent operation. This will of course need to be
addressed in the future.
The Expected parameter has a relatively large number of variations, and most
of them are supported by this code, except that currenly only two comparison
operators are supported (EQ and BEGINS_WITH) out of the 13 listed in the
documentation. The rest will be implemented later.
This patch also includes comprehensive tests for the Expected feature.
These tests are almost exhaustive, except for one missing part (labled FIXME) -
among the 13 comparison operations, the tests only check the EQ and BEGINS_WITH
operators. We'll later need to add checks to the rest of them as well.
As usual, all the tests pass on Amazon DynamoDB, and after this patch all
of them succeed on Alternator too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190905125558.29133-1-nyh@scylladb.com>
In order for Spark jobs to work correctly, a hardcoded PAY_PER_REQUEST
billing mode entry is returned when describing a table with
a DescribeTable request.
Also, one test case in test_describe_table.py is no longer marked XFAIL.
Message-Id: <a4e6d02788d8be48b389045e6ff8c1628240197c.1567688894.git.sarna@scylladb.com>
This patch adds a getting started document for alternator,
it explains how to start up a cluster that has an alternator
API port open and how to test that it works using either an
application or some simple and minimal python scripts.
The goal of the document is to get a user to have an up and
running docker based cluster with alternator support in the
shortest time possible.
As part of trying to make alternator more accessible
to users, we expect more documents to be created so
it seems like a good idea to give all of the alternator
docs their own directory.
When updating the second regular base column that is also a view
key, the code in Scylla will assume it only needs to update an entry
instead of replacing an old one. This leads to inconsitencies
exposed in the test case.
Message-Id: <5dfeb9f61f986daa6e480e9da4c7aabb5a09a4ec.1567599461.git.sarna@scylladb.com>
LSI tests are amended, so they no longer needlessly XPASS:
* two xpassing tests are no longer marked XFAIL
* there's an additional test for partial projection
that succeeds on DynamoDB and does not work fine yet in alternator
Message-Id: <0418186cb6c8a91de84837ffef9ac0947ea4e3d3.1567585915.git.sarna@scylladb.com>
The previous patch fixed test_describe_endpoints.py for a local run
without an AWS configuration. But when running with "--aws", we do
need to use that AWS configuration, and this patch fixes this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Even when running against a local Alternator, Boto3 wants to know the
region name, and AWS credentials, even though they aren't actually needed.
For a local run, we can supply garbage values for these settings, to
allow a user who never configured AWS to run tests locally.
Running against "--aws" will, of course, still require the user to
configure AWS.
The previous patch already fixed this for most tests, this patch fixes the
same issue in test_describe_endpoints.py, which had a separate copy of the
problematic code.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Even when running against a local Alternator, Boto3 wants to know the
region name, and AWS credentials, even though they aren't actually needed.
For a local run, we can supply garbage values for these settings, to
allow a user who never configured AWS to run tests locally.
Running against "--aws" will, of course, still require the user to
configure AWS.
Also modified the README to be clearer, and more focused on the local
runs.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708121420.7485-1-nyh@scylladb.com>
Unfortunately the previous 100s limit proved to be not enough
for creating tables with both local and global indexes attached
to them. Empirically 200s was chosen as a safe default,
as the longest test oscillated around 100s with the deviation of 10s.
With this patch, LocalSecondaryIndexes can be added to a table
during its creation. The implementation is heavily shared
with GlobalSecondaryIndexes and as such suffers from the same TODOs:
projections, describing more details in DescribeTable, etc.
We currently reserve the column name "attrs" for a map of attributes,
so the user is not allowed to use this name as a name of a key.
We plan to lift this reservation in a future patch, but until we do,
let's at least choose a more obscure name to forbid - in this patch ":attrs".
It is even less likely that a user will want to use this specific name
as a column name.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190903133508.2033-1-nyh@scylladb.com>
Currently, we reserve the name ATTRS_COLUMN_NAME ("attrs") - the user
cannot use it as a key column name (key of the base table or GSI or LSI)
because we use this name for the attribute map we add to the schema.
Currently, if the user does attempt to create such a key column, the
result is undefined (sometimes corrupt sstables, sometimes outright crashes).
This patches fixes it to become a clean error, saying that this column name is
currently reserved.
The test test_create_table_special_column_name now cleanly fails, instead
of crashing Scylla, so it is converted from "skip" to "xfail".
Eventually we need to solve this issue completely (e.g., in rare cases
rename columns to allow us to reserve a name like ATTRS_COLUMN_NAME,
or alternatively, instead of using a fixed name ATTRS_COLUMN_NAME pick a
different one different from the key column names). But until we do,
better fail with a clear error instead of a crash.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190901102832.7452-1-nyh@scylladb.com>
The file initially consists of a very simple case that succeeds
with `--aws` and expectedly fails without it, because the expression
is not implemented yet.
This adds a "alternator-address" and "alternator-port" configuration
options to the Docker image, so people can enable Alternator with
"docker run" with:
docker run --name some-scylla -d <image> --alternator-port=8080
Message-Id: <20190902110920.19269-1-penberg@scylladb.com>
When an unsupported expression parameter is encountered -
KeyConditionExpression, ConditionExpression or FilterExpression
are such - alternator will return an error instead of ignoring
the parameter.
This patch make two chagnes to the alternator stats:
1. It add estimated_histogram for the get, put, update and delete
operation
2. It changes the metrics naming, so the operation will be a label, it
will be easier to handle, perform operation and display in this way.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The test_gsi_3, involving creating a GSI with two key columns which weren't
previously a base key, now passes, so drop the "xfail" marker.
We still have problems with such materialized views, but not in the simple
scenario tested by test_gsi_3.
Later we should create a new test for the scenario which still fails, if
any.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Creating an underlying materialized view with 2 regular base columns
is risky in Scylla, as second's column liveness will not be correctly
taken into account when ensuring view row liveness.
Still, in case specific conditions are met:
* the regular base column value is always present in the base row
* no TTLs are involved
then the materialized view will behave as expected.
Creating a GSI with 2 base regular columns issues a warning,
as it should be performed with care.
Message-Id: <5ce8642c1576529d43ea05e5c4bab64d122df829.1567159633.git.sarna@scylladb.com>
It is important that BillingMode should default to PROVISIONED, as it
does on DynamoDB. This allows old clients, which don't specify
BillingMode at all, to specify ProvisionedThroughput as allowed with
PROVISIONED.
Also added a test case for this case (where BillingMode is absent).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190829193027.7982-1-nyh@scylladb.com>
When querying on a missing index, DynamoDB returns different errors in
case the entire table is missing (ResourceNotFoundException) or the table
exists and just the index is missing (ValidationException). We didn't
make this distinction, and always returned ValidationException, but this
confuses clients that expect ResourceNotFoundException - e.g., Amazon's
Tic-Tac-Toe demo.
This patch adds a test for the first case (the completely missing table) -
we already had a test for the second case - and returns the correct
error codes. As usual the test passes against DynamoDB as well as Alternator,
ensure they behave the same.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190829174113.5558-1-nyh@scylladb.com>
We needlessly split the trace-level log message for the request to two
messages - one containing just the operation's name, and one with the
parameters. Moreover we printed them in the opposite order (parameters
first, then the operation). So this patch combines them into one log
message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190829165341.3600-1-nyh@scylladb.com>
Alternator puts in the Scylla table a column called "attrs" for all the
non-key attributes. If the user happens to choose the same name, "attrs",
for one of the key columns, the result of writing two different columns
with the same name is a mess and corrupt sstables.
This test reproduces this bug (and works against DynamoDB of course).
Because the test doesn't cleanly fail, but rather leaves Scylla in a bad
state from which it can't fully recover, the test is marked as "skip"
until we fix this bug.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190828135644.23248-1-nyh@scylladb.com>
Updating key columns is not allowed in UpdateItem requests,
but the series introducing GSI support for regular columns
also introduced redundant duplicates checks of this kind.
This condition is already checked in resolve_update_path helper function
and existing test_update_expression_cannot_modify_key test makes sure that
the condition is checked.
Message-Id: <00f83ab631f93b263003fb09cd7b055bee1565cd.1567086111.git.sarna@scylladb.com>
The test test_update_expression_cannot_modify_key() verifies that an
update expression cannot modify one of the key columns. The existing
test only tried the SET and REMOVE actions - this patch makes the
test more complete by also testing the ADD and DELETE actions.
This patch also makes the expected exception more picky - we now
expect that the exception message contains the word "key" (as it,
indeed, does on both DynamoDB and Alternator). If we get any other
exception, there may be a problem.
The test passed before this patch, and passes now as well - it's just
stricter now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190829135650.30928-1-nyh@scylladb.com>
The code previously used clustering_key::from_singular() to compute
a clustering key value. It works fine, but has two issues:
1. involves one redundant deserialization stage compared to
from_single_value
2. does not work with compound clustering keys, which can appear
when using indexes
With more GSI features implemented, tests with XPASS status are promoted
to being enabled.
One test case (test_gsi_describe) is partially done as DescribeTable
now contains index names, but we could try providing more attributes
(e.g. IndexSizeBytes and ItemCount from the test case), so the test
is left in the XFAIL state.
The DescribeTable request now contains the list of index names
as well. None of the attributes of the list are marked as 'required'
in the documentation, so currently the implementation provides
index names only.
In order to be able to create a Global Secondary Index over a regular
column, this column is upgraded from being a map entry to being a full
member of the schema. As such, it's possible to use this column
definition in the underlying materialized view's key.
In order to prepare alternator for adding regular columns to schema,
i.e. in order to create a materialized view over them,
the code is changed so that updating no longer assumes that only keys
are included in the table schema.
Since in the future we may want to have more regular columns
in alternator tables' schemas, the code is changed accordingly,
so all regular columns will be fetched instead of just the attribute
map.
If no regular column attributes are passed to PutItem, the attr
collector serializes an empty collection mutation nonetheless
and sends it. It's redundant, so instead, if the attr colector
is empty, the collection does not get serialized and sent to replicas.
Keeping an instance of client_state is a convenient way of being able
to use tracing for alternator. It's also currently used in paging,
so adding a client state to executor removes the need of keeping
a dummy value.
String views used in JSON serialization should use not only the pointer
returned by rapidjson, but also the string length, as it may contain
\0 characters.
Additionally, one unnecessary copy is elided.
Add a link to a longer document (currently, around 40 pages) about
DynamoDB's features and how we implemented or may implement them in
Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190825121201.31747-2-nyh@scylladb.com>
If a user tries to create a table with a unsupported feature -
a local secondary index, a used-defined encryption key or supporting
streams (CDC), let's refuse the table creation, so the application
doesn't continue thinking this feature is available to it.
The "Tags" feature is also not supported, but it is more harmless
(it is used mostly for accounting purposes) so we do not fail the
table creation because of it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190818125528.9091-1-nyh@scylladb.com>
In CQL, before a user can create a table, they must create a keyspace to
contain this table and, among other things, specify this keyspace's RF.
But in the DynamoDB API, there is no "create keyspace" operation - the
user just creates a table, and there is no way, and no opportunity,
to specify the requested RF. Presumably, Amazon always uses the same
RF for all tables, most likely 3, although this is not officially
documented anywhere.
The existing code creates the keyspace during Scylla boot, with RF=1.
This RF=1 always works, and is a good choice for a one-node test run,
but was a really bad choice for a real cluster with multiple nodes, so
this patch fixes this choice:
With this patch, the keyspace creation is delayed - it doesn't happen
when the first node of the cluster boots, but only when the user creates
the first table. Presumably, at that time, the cluster is already up,
so at that point we can make the obvious choice automatically: a one-node
cluster will get RF=1, a >=3 node cluster will get RF=3. The choice of
RF is logged - and the choice of RF=1 is considered a warning.
Note that with this patch, keyspace creation is still automatic as it
was before. The user may manually create the keyspace via CQL, to
override this automatic choice. In the future we may also add additional
keyspace configuration options via configuration flags or new REST
requests, and the keyspace management code will also likely change
as we start to support clusters with multiple regions and global
tables. But for now, I think the automatic method is easiest for
users who want to test-drive Alternator without reading lengthy
instructions on how to set up the keyspace.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190820180610.5341-1-nyh@scylladb.com>
We allow BillingMode to be set to either PAY_PER_REQUEST (the default)
or PROVISIONED, although neither mode is fully implemented: In the former
case the payment isn't accounted, and in the latter case the throughput
limits are not enforced.
But other settings for BillingMode are now refused, and we add a new test
to verify that.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190818122919.8431-1-nyh@scylladb.com>
The alternator tests want to exercise many of the DynamoDB API features,
so they need a recent enough version of the client libraries, boto3
and botocore. In particular, only in botocore 1.12.54, released a year
ago, was support for BillingMode added - and we rely on this to create
pay-per-request tables for our tests.
Instead of letting the user run with an old version of this library and
get dozens of mysterious errors, in this patch we add a test to conftest.py
which cleanly aborts the test if the libraries aren't new enough, and
recommends a "pip" command to upgrade these libraries.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190819121831.26101-1-nyh@scylladb.com>
The DescribeTable operation was currently implemented to return the
minimal information that libraries and applications usually need from
it, namely verifying that some table exists. However, this operation
is actually supposed to return a lot more information fields (e.g.,
the size of the table, its creation date, and more) which we currently
don't return.
This patch adds a new test file, test_describe_table.py, testing all
these additional attributes that DescribeTable is supposed to return.
Several of the tests are marked xfail (expected to fail) because we
did not implement these attributes yet.
The test is exhaustive except for attributes that have to do with four
major features which will be tested together with these features: GSI,
LSI, streams (CDC), and backup/restore.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190816132546.2764-1-nyh@scylladb.com>
Currently Alternator starts all Scylla requests (including both reads
and writes) without any timeout set. Because of bugs and/or network
problems, Requests can theoretically hang and waste Scylla request for
hours, long after the client has given up on them and closed their
connection.
The DynamoDB protocol doesn't let a user specify which timeout to use,
so we should just use something "reasonable", in this patch 10 seconds.
Remember that all DynamoDB read and write requests are small (even scans
just scan a small piece), so 10 seconds should be above and beyond
anything we actually expect to see in practice.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190812105132.18651-1-nyh@scylladb.com>
So far we had the "--alternator-port" option allowing to configure the port
on which the Alternator server listens on, but the server always listened
to any address. It is important to also be able to configure the listen
address - it is useful in tests running several instances of Scylla on
the same machine, and useful in multi-homed machines with several interfaces.
So this patch adds the "--alternator-address" option, defaulting to 0.0.0.0
(to listen on all interfaces). It works like the many other "--*-address"
options that Scylla already has.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190808204641.28648-1-nyh@scylladb.com>
It turns out that recent rjson patches introduced some buggy
tabs instead of spaces due to bad IDE configuration. The indentation
is restored to spaces.
Until now, filtering in alternator was possible only for non-key
column equality relations. This commit adds support for equality
relations for key columns.
Alternator allows passing hash and sort key restrictions
as filters - it is, however, better to incorporate these restrictions
directly into partition and clustering ranges, if possible.
It's also necessary, as optimizations inside restrictions_filter
assume that it will not be fed unneeded rows - e.g. if filtering
is not needed on partition key restrictions, they will not be checked.
Currently the only utility function for getting key bytes
from JSON was to parse a document with the following format:
"key_column_name" : { "key_column_type" : VALUE }.
However, it's also useful to parse only the inner document, i.e.:
{ "key_column_type" : VALUE }.
Three metrics related to filtering are added to alternator:
- total rows read during filtering operations
- rows read and matched by filtering
- rows read and dropped by filtering
Some underlying operations (e.g. paging) make use of cql_stats
structure from CQL3. As such, cql_stats structure is added
to alternator stats in order to gather and use these statistics.
Read-before-write stat counters were already introduced, but the metrics
needs to be added to a metric group as well in order to be available
for users.
This patch adds partial support for GSI (Global Secondary Index) in
Alternator, implemented using a materialized view in Scylla.
This initial version only supports the specific cases of the index indexing
a column which was already part of the base table's key - e.g., indexing
what used to be a sort key (clustering key) in the base table. Indexing
of non-key attributes (which today live in a map) is not yet supported in
this version.
Creation of a table with GSIs is supported, and so is deleting the table.
UpdateTable which adds a GSI to an existing table is not yet supported.
Query and Scan operations on the index are supported.
DescribeTable does not yet list the GSIs as it should.
Seven previously-failing tests now pass, so their "xfail" tag is removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190808090256.12374-1-nyh@scylladb.com>
The rapidjson library needs to be used with caution in order to
provide maximum performance and avoid undefined behavior.
Comments added to rjson.hh describe provided methods and potential
pitfalls to avoid.
Message-Id: <ba94eda81c8dd2f772e1d336b36cae62d39ed7e1.1565270214.git.sarna@scylladb.com>
With libjsoncpp we were forced to work around the problem of
non-noexcept constructors by using an intermediate unique pointer.
Objects provided by rapidjson have correct noexcept specifiers,
so the workaround can be dropped.
Profiling alternator implied that JSON parsing takes up a fair amount
of CPU, and as such should be optimized. libjsoncpp is a standard
library for handling JSON objects, but it also proves slower than
rapidjson, which is hereby used instead.
The results indicated that libjsoncpp used roughly 30% of CPU
for a single-shard alternator instance under stress, while rapidjson
dropped that usage to 18% without optimizations.
Future optimizations should include eliding object copying, string copying
and perhaps experimenting with different JSON allocators.
Migrating from libjsoncpp to rapidjson proved to be beneficial
for parsing performance. As a first step, a set of helper functions
is provided to ease the migration process.
error.hh file implicitly assumed that seastar:: namespace is available
when it's included, which is not always the case. To remedy that,
seastar::httpd namespace is used explicitly.
Our CreateTable handler assumed that the function
migration_manager::announce_new_column_family()
returns a failed future if the table already exists. But in some of
our code branches, this is not the case - the function itself throws
instead of returning a failed future. The solution is to use
seastar::futurize_apply() to handle both possibilities (direct exception
or future holding an exception).
This fixes a failure of the test_table.py::test_create_table_already_exists
test case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This adds a new document, docs/alternator.md, about Alternator.
The scope of this document should be expanded in the future. We begin
here by introducing Alternator and its current compatibility level with
Amazon DynamoDB, but it should later grow to explain the design of Alternator
and how it maps the DynamoDB data model onto Scylla's.
Whether this document should remain a short high-level overview, or a long
and detailed design document, remains an open question.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190805085340.17543-1-nyh@scylladb.com>
The function attrs_type() return a supposedly singleton, but because
it is a seastar::shared_ptr we can't use the same one for multiple
threads, and need to use a separate one per thread.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190804163933.13772-1-nyh@scylladb.com>
The CQL type singletons like utf8_type et al. are separate for separate
shards and cannot be used across shards. So whatever hash tables we use
to find them, also needs to be per-shard. If we fail to do this, we
get errors running the debug build with multiple shards.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190804165904.14204-1-nyh@scylladb.com>
Expand the GSI test suite. The most important new test is
test_gsi_key_not_in_index(), where the index's key includes just one of
the base table's key columns, but not a second one. In this case, the
Scylla implementation will nevertheless need to add the second key column
to the view (as a clustering key), even though it isn't considered a key
column by the DynamoDB API.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190718085606.7763-1-nyh@scylladb.com>
Our ListTables implementation uses get_column_families(), which lists both
base tables and materialized views. We will use materialized views to
implement DynamoDB's secondary indexes, and those should not be listed in
the results of ListTables.
The patch also includes a test for this.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190717133103.26321-2-nyh@scylladb.com>
The list_tables() utility function was used only in test_table.py
but I want to use it elsewhere too (in GSI test) so let's move it
to util.py.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190717133103.26321-1-nyh@scylladb.com>
As in case of set_diff, an exception message in set_sum should include
the user-provided request (ADD) rather than our internal helper function
set_sum.
Although we do not support GSI yet, until now we silently ignored
CreateTable's GSI parameter, and the user wouldn't know the table
wasn't created as intended.
In this patch, GSI is still unsupported, but now CreateTable will
fail with an error message that GSI is not supported.
We need to change some of the tests which test the error path, and
expect an error - but should not consider a table creation error
as the expected error.
After this patch, test_gsi.py still fails all the tests on
Alternator, but much more quickly :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190711161420.18547-1-nyh@scylladb.com>
The test case for adding two sets with common values is added.
This case is a stub, because boto3 transforms the result into a Python
set, which removes duplicates on its own. A proper TODO is left
in order to migrate this case to a lower-level API and check
the returned JSON directly for lack of duplicates.
The Query operation's conditions can be used to search for a particular
hash key or both hash and sort keys - but not any other combinations.
We previously forgot to verify most errors, so in this patch we add
missing verifications - and tests to confirm we fail the query when
DynamoDB does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190711132720.17248-1-nyh@scylladb.com>
Add more tests for GSI - tests that DescribeTable describes the GSI,
and test the case of more than one GSI for a base table.
Unfortunately, creating an empty table with two GSIs routinely takes
on DynamoDB more than a full minute (!), so because we now have a
test with two GSIs, I had to increase the timeout in create_test_table().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190711112911.14703-1-nyh@scylladb.com>
The holds_path() utility function is actually used to check if a value
needs read before write, so its name is changed to more fitting
check_needs_read_before_write.
Alternator currently keeps an item's attributes inside a map, and we
had a serious bug in the way we build mutations for this map:
We didn't know there was a requirement to build this mutation sorted by
the attribute's name. When we neglect to do this sorting, this confuses
Scylla's merging algorithms, which assume collection cells are thus
sorted, and the result can be duplicate cells in a collection, and the
visible effect is a mutation that seems to be ignored - because both
old and new values exist in the collection.
So this patch includes a new helper class, "attribute_collector", which
helps collect attribute updates (put and del) and extract them in correctly
sorted order. This helper class also eliminates some duplication of
arcane code to create collection cells or deletions of collection cells.
This patch includes a simple test that previously failed, and one xfail
test that failed just because of this bug (this was the test that exposed
this bug). Both tests now succeed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190709160858.6316-1-nyh@scylladb.com>
This patch adds what is hopefully an exhaustive test suite for the
global secondary indexing (GSI) feature, and all its various
complications and corner cases of how GSIs can be created, deleted,
named, written, read, and more (the tests are heavily documented to
explain what they are testing).
All these tests pass on DynamoDB, and fail on Alternator, so they are
marked "xfail". As we develop the GSI feature in Alternator piece by
piece, we should make these tests start to pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708160145.13865-1-nyh@scylladb.com>
This adds another test for BatchWriteItem: That if one of the operations is
invalid - e.g., has a wrong key type - the entire batch is rejected, and not
none of its operations are done - even the valid ones.
The test succeeds, because we already handle this case correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190707134610.30613-1-nyh@scylladb.com>
Test an operation like SET #one = #two, where the RHS has a reference
to a name, rather than the name itself. Also verify that DynamoDB
gives an error if ExpressionAttributeNames includes names not needed
by neither left or right hand side of such assignments.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708133311.11843-1-nyh@scylladb.com>
In order to serve update requests that depend on read-before-write,
a proper helper function which fetches the existing item with a given
key from the database is added.
This read-before-write mechanism is not considered safe, because it
provides no linearizability guarantees and offers no synchronization
protection. As such, it should be consider a placeholder that works
fine on a single machine and/or no concurrent access to the same key.
The calculate_value utility function is going to need more context
in order to resolve paths present in the right-hand side of update_item
operators: update_info and schema.
Historically, resolving a path checked for key columns, which are not
allowed to be on the left-hand side of the assignment. However, path
resolving will now also be used for right-hand side, where it should
be allowed to use the key value.
In order to implement read-before-write in the future, calculate_value
now accepts an additional parameter: previous_item. If read-before-write
was performed, previous_item will contain an item for the given key
which already exists in the database at the time of the update.
This patch moves the create_test_table() utility function, which creates
a test table with a unique name, from the fixtures (conftest.py) to
util.py. This will allow reusing this function in tests which need to
create tables but not through the existing fixtures. In particular
we will need to do this for GSI (global secondary index) tests
in the next patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708104438.5830-1-nyh@scylladb.com>
The tests we had for BatchWriteItem's refusal to accept duplicate keys
only used test_table_s, with just a hash key. This patch adds tests
for test_table, i.e., a table with both hash and sort keys - to check
that we check duplicates in that case correctly as well.
Moreover, the expanded tests also verify that although identical
keys are not allowed, keys with just one component (hash or sort key)
the same but the other not the same - are fine.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190705191737.22235-1-nyh@scylladb.com>
Even when running against a local Alternator, Boto3 wants to know the
region name, and AWS credentials, even though they aren't actually needed.
For a local run, we can supply garbage values for these settings, to
allow a user who never configured AWS to run tests locally.
Running against "--aws" will, of course, still require the user to
configure AWS.
Also modified the README to be clearer, and more focused on the local
runs.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708121420.7485-1-nyh@scylladb.com>
For "--aws" tests, use the default region chosen by the user in the
AWS configuration (~/.aws/config or environment variable), instead of
hard-coding "us-east-1".
Patch by Pekka Enberg.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708105852.6313-1-nyh@scylladb.com>
Calculating value represented as 'v1 + v2' or 'v1 - v2' was previously
implemented with a double type, which offers limited precision.
From now on, these computations are based on big_decimal, which
allows returning values without losing precision.
This patch depends on 'add big_decimal arithmetic operators' series.
Message-Id: <f741017fe3d3287fa70618068bdc753bfc903e74.1562318971.git.sarna@scylladb.com>
Move some common utility functions to a common file "util.py"
instead of repeating them in many test files.
The utility functions include random_string(), random_bytes(),
full_scan(), full_query(), and multiset() (the more general
version, which also supports freezing nested dicts).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190705081013.1796-1-nyh@scylladb.com>
The idiomatic way to use an std::variant depending the type holds is to use
std::visit. This modern API makes it unnecessary to write many boiler-plate
functions to test and cast the type of the variant, and makes it impossible
to forget one of the options. So in this patch we throw out the old ways,
and welcome the new.
Thanks to Piotr Sarna for the idea.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190704205625.20300-1-nyh@scylladb.com>
This patch adds to Alternator an implementation of the BatchGetItem
operation, which allows to start a number of GetItem requests in parallel
in a single request.
The implementation is almost complete - the only missing feature is the
ability to ask only for non-top-level attributes in ProjectionExpression.
Everything else should work, and this patch also includes tests which,
as usual, pass on DynamoDB and now also on Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Amazingly, it appears we never tested booting Alternator a second time :-)
Our initialization code creates a new keyspace, and was supposed to ignore
the error if this keyspace already existed - but we thought the error will
come as an exceptional future, which it didn't - it came as a thrown
exception. So we need to change handle_exception() to a try/catch.
With this patch, I can kill Alternator and it will correctly start again.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Operations which take a key as parameter, namely GetItem, UpdateItem,
DeleteItem and BatchWriteItem's DeleteRequest, already fail if the given
key is missing one of the nessary key attributes, or has the wrong types
for them. But they should also fail if the given key has spurious
attributes beyond those actually needed in a key.
So this patch adds this check, and tests to confirm that we do these checks
correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The PutItem operation, and also the PutRequest of BatchWriteItem, are
supposed to completely replace the item - not to merge the new value with
the previous value. We implemented this wrongly - we just wrote the new
item forgetting a tombstone to remove the old item.
So this patch fixes these operations, and adds tests which confirm the
fix (as usual, these tests pass on DynamoDB, failed on Alternator before
this patch, and pass after the patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add support for the DeleteItem operation, which deletes an item.
The basic deletion operation is supported. Still not supported are:
1. Parameters to conditionally delete (ConditionalExpression or Expected)
2. Parameters to return pre-delete content
3. ReturnItemCollectionMetrics (statistics relevant for tables with LSI)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In BatchWriteItem, we currently only support the PutRequest operation.
If a user tries to use DeleteRequest (which we don't support yet), he
will get a bizarre error. Let's test the request type more carefully,
and print a better error message. This will also be the place where
eventually we'll actually implement the DeleteRequest.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds more comprehensive tests for the BatchWriteItem operation,
in a new file batch_test.py. The one test we already had for it was also
moved from test_item.py here.
Some of the test still xfail for two reasons:
1. Support for the DeleteRequest operation of BatchWriteItem is missing.
2. Tests that forbid duplicate keys in the same request are missing.
As usual, all tests succeed on DynamoDB, and hopefully (I tried...)
cover all the BatchWriteItem features.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB has two similar parameters - AttributesToGet and
ProjectionExpression - which are supported by the GetItem, Scan and
Query operations. Until now we supported only the older AttributesToGet,
and this patch adds support to the newer ProjectionExpression.
Besides having a different syntax, the main difference between
AttributesToGet and ProjectionExpression is that the latter also
allows fetching only a specific nested attribute, e.g., a.b[3].c.
We do not support this feature yet, although it would not be
hard to add it: With our current data representation, it means
fetching the top-level attribute 'a', whose value is a JSON, and then
post-filtering it to take out only the '.b[3].c'. We'll do that
later.
This patch also adds more test cases to test_projection_expression.py.
All tests except three which check the nested attributes now pass,
and those three xfail (they succeed on DynamoDB, and fail as expected
on Alternator), reminding us what still needs to be done.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our GetItem, Query and Scan implementations support the AttributesToGet
parameter to fetch only a subset of the attributes, but we don't yet
support the more elaborate ProjectionExpression parameter, which is
similar but has a different syntax and also allows to specify nested
document paths.
This patch adds existive testing of all the ProjectionExpression features.
All these tests pass against DynamoDB, but fail against the current
Alternator so they are marked "xfail". These tests will be helpful for
developing the ProjectionExpression feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The AttributesToGet parameter - saying which attributes to fetch for each
item - is already supported in the GetItem, Query and Scan operations.
However, we only had a test for it for it for Scan. This patch adds
similar tests also for the GetItem and Query operations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Yet another test for overwriting a top-level attribute which contains
a nested document - here, overwriting it by just a string.
This test passes. In the current implementation we don't yet support
updates to specific attribute paths (e.g. a.b[3].c) but we do support
well writing and over-writing top-level attributes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch implements the last (finally!) syntactic feature of the
UpdateExpression - the ability to do SET a=val1+val2 (where, as
before, each of the values can be a reference to a value, an
attribute path, or a function call).
The implementation is not perfect: It adds the values as double-precision
numbers, which can lose precision. So the patch adds a new test which
checks that the precision isn't lost - a test that currently fails
(xfail) on Alternator, but passes on DynamoDB. The pre-existing test
for adding small integer now passes on Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the previous patch we added function-call support in the UpdateExpression
parser. In this patch we add support for one such function - list_append().
This function takes two values, confirms they are lists, and concatenates
them. After this patch only one function remains unimplemented:
if_not_exists().
We also split the test we already had for list_append() into two tests:
One uses only value references (":val") and passes after this patch.
The second test also uses references to other attributes and will only
work after we start supporting read-modify-write.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Until this patch, in update expressions like "SET a = :val", we only
allowed the right-hand-side of the assignment to be a reference to a
value stored in the request - like ":val" in the above example.
But DynamoDB also allows the value to be an attribute path (e.g.,
"a.b[3].c", and can also be a function of a bunch of other values.
This patch adds supports for parsing all these value types.
This patch only adds the correct parsing of these additional types of
values, but they are still not supported: reading existing attributes
(i.e., read-modify-write operations) is still not supported, and
none of the two functions which UpdateExpression needs to support
are supported yet. Nevertheless, the parsing is now correct, and the
the "unknown_function" test starts to pass.
Note that DynamoDB allows the right-hand side of an assignment to be
not only a single value, but also value+value and value-value. This
possibility is not yet supported by the parser and will be added
later.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test cases verify that equality-based filtering on non-key
attributes works fine. It also contains test stubs for key filtering
and non-equality attribute filtering.
Filled test table used to have identical non-key attributes for all
rows. These values are now diversified in order to allow writing
filtering test cases.
Filtering is currently only implemented for the equality operator
on non-key attributes.
Next steps (TODO) involve:
1. Implementing filtering for key restrictions
2. Implementing non-key attribute filtering for operators other than EQ.
It, in turn, may involve introducing 'map value restrictions' notion
to Scylla, since now it only allows equality restrictions on map
values (alternator attributes are currently kept in a CQL map).
3. Implementing FilterExpression in addition to deprecated QueryFilter
Before this patch, we read either an attribute name like "name" or
a reference to one "#name", as one type of token - NAME.
However, while attribute paths indeed can use either one, in some other
contexts - such as a function name - only "name" is allowed, so we
need to distinguish between two types of tokens: NAME and NAMEREF.
While separating those, I noticed that we incorrectly allowed a "#"
followed by *zero* alphanumeric characters to be considered a NAMEREF,
which it shouldn't. In other words, NAMEREF should have ALNUM+, not ALNUM*.
Same for VALREF, which can't be just a ":" with nothing after it.
So this patch fixes these mistakes, and adds tests for them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB complains, and fails an update, if the update contains in
ExpressionAttributeNames or ExpressionAttributeValues names which aren't
used by the expression.
Let's do the same, although sadly this means more work to track which
of the references we've seen and which we haven't.
This patch makes two previously xfail (expected fail) tests become
successful tests on Alternator (they always succeeded against DynamoDB).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The existing tests in test_update_expression.py thoroughly tested the
UpdateExpression features which we currently support. But tests for
features which Alternator *doesn't* yet support were partial.
In this patch, we add a large number of new tests to
test_update_expression.py aiming to cover ALL the features of
UpdateExpression, regardless of whether we already support it in
Alternator or not. Every single feature and esoteric edge-case I could
discover is covered in these tests - and as far as I know these tests
now cover the *entire* UpdateExpression feature. All the tests succeed
on DynamoDB, and confirm our understanding of what DynamoDB actually does
on all these cases.
After this patch, test_update_expression.py is a whopper, with 752 lines of
code and 37 separate test functions. 23 out of these 37 tests are still
"xfail" - they succeed on DynamoDB but fail on Alternator, because of
several features we are still missing. Those missing features include
direct updates of nested attributes, read-modify-write updates (e.g.,
"SET a=b" or "SET a=a+1"), functions (e.g., "SET a = list_append(a, :val)"),
the ADD and DELETE operations on sets, and various other small missing
pieces.
The benefit of this whopper test is two-fold: First, it will allow us
to test our implementation as we continue to fill it (i.e., "test-
driven development"). Second, all these tested edge cases basically
"reverse engineer" how DynamoDB's expression parser is supposed to work,
and we will need this knowledge to implement the still-missing features of
UpdateExpression.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds an extensive array of tests for UpdateItem's UpdateExpression
support, which was introduced in the previous patch.
The tests include verification of various edge cases of the parser, support
for ":value" and "#name" references, functioning SET and REMOVE operations,
combinations of multiple such operations, and much more.
As usual, all these tests were ran and succeed on DynamoDB, as well as on
Alternator - to confirm Alternator behaves the same as DynamoDB.
There are two tests marked "xfail" (expected to fail), because Alternator
still doesn't support the attribute copy syntax (e.g., "SET a = b",
doing a read-before-write).
There are some additional areas which we don't support - such as the DELETE
and ADD operations or SET with functions - but those areas aren't yet test
in these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For the UpdateItem operation, so far we supported updates via the
AttributeUpdates parameter, specifying which attributes to set or remove
and how. But this parameter is considered deprecated, and DynamoDB supports
a more elaborate way to modify attributes, via an "UpdateExpression".
In the previous patch we added a function to parse such an UpdateExpression,
and in this patch we use the result of this parsing to actually perform
the required updates.
UpdateExpression is only partially supported after this patch. The basic
"SET" and "REMOVE" operations are supported, but various other cases aren't
fully supported and will be fixed in followup patches. The following
patch will add extensive tests to confirm exactly what works correctly
with the new UpdateExpression support.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The DynamoDB protocol is based on JSON, and most DynamoDB requests describe
the operation and its parameters via JSON objects such as maps and lists.
However, in some types of requests an "expression" is passed as a single
string, and we need to parse this string. These cases include:
1. Attribute paths, such as "a[3].b.c", are used in projection
expressions as well as inside other expressions described below.
2. Condition expressions, such as "(NOT (a=b OR c=d)) AND e=f",
used in conditional updates, filters, and other places.
3. Update expressions, such as "SET #a.b = :x, c = :y DELETE d"
This patch introduces the framework to parse these expressions, and
an implementation of parsing update expressions. These update expressions
will be used in the UpdateItem operation in the next patch.
All these expression syntaxes are very simple: Most of them could be
parsed as regular expressions, or at most a simple hand-written lexical
analyzer and recursive-descent parser. Nevertheless, we decided to specify
these parsers in the same ANTLR3 language already used in the Scylla
project for parsing CQL, hopefully making these parsers easier to reason
about, and easier to change if needed - and reducing the amount of boiler-
plate code.
The parsing of update expressions is most complete except that in SET
actions, only the "path = value" form is supported and not yet forms
forms such as "path1 = path2" (which does read-before-write) or
"path1 = path1 + value" or "path = function(...)".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We need to write more tests for various case of handling
nested documents and nested attributes. Let's collect them
all in the same test file.
This patch mostly moves existing code, but also adds one
small test, test_nested_document_attribute_write, which
just writes a nested document and reads it back (it's
mostly covered by the existing test_put_and_get_attribute_types,
but is specifically about a nested document).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We usually run Alternator tests against the local Alternator - testing
against AWS DynamoDB is rarer, and usually just done when writing the
test. So let's make "pytest" without parameters default to testing locally.
To test against AWS, use "pytest --aws" explicitly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Attributes for reads (GetItem, Query, Scan, ...) and writes (PutItem,
UpdateItem, ...) are now serialized and deserialized in binary form
instead of raw JSON, provided that their type is S, B, BOOL or N.
Optimized serialization for the rest of the types will be introduced
as follow-ups.
Message-Id: <6aa9979d5db22ac42be0a835f8ed2931dae208c1.1559646761.git.sarna@scylladb.com>
Attributes used to be written into the database in raw JSON format,
which is far from optimal. This patch introduces more robust
serializationi routines for simple alternator types: S, B, BOOL, N.
Serialization uses the first byte to encode attribute type
and follows with serializing data in binary form.
More complex types (sets, lists, etc.) are currently still
serialized in raw JSON and will be optimized in follow-up patches.
Message-Id: <10955606455bbe9165affb8ac8fba4d9e7c3705f.1559646761.git.sarna@scylladb.com>
For some unknown reason we put the list of alternator source files
in configure.py inside the "api" list. Let's move it into a separate
list.
We could have just put it in the scylla_core list, but that would cause
frequent and annoying patch conflicts when people add alternator source
files and Scylla core source files concurrently.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
So far for UpdateItem we only supported the old-style AttributeUpdates
parameter, not the newer UpdateExpression. This patch begins the path
to supporting UpdateExpression. First, trying to use *both* parameters
should result in an error, and this patch does this (and tests this).
Second, passing neither parameters is allowed, and should result in
an *empty* item being created.
Finally, since today we do not yet support UpdateExpression, this patch
will cause UpdateItem to fail if UpdateExpression is used, instead of
silently being ignored as we did so far.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds two simple tests for nested documents, which pass:
test_nested_document_attribute_overwrite() tests what happens when
we UpdateItem a top-level attribute to a dictionary. We already tested
this works on an empty item in a previous test, but now we check what
happens when the attribute already existed, and already was a dictionary,
and now we update it to a new dictionary. In the test attribute a was
{b:3, c:4} and now we update it to {c:5}. The test verifies that the new
dictionary completely replaces the old one - the two are not merged.
The new value of the attribute is just {c:5}, *not* {b:3, c:5}.
The second test verifies that the AttributeUpdates parameter of
UpdateItem cannot be used to update a just a nested attributes.
Any dots in the attribute name are considered an actual dot - not
part of a path of attribute names.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Comparing two lists of items without regard for order is not trivial.
For this reason some tests in test_query.py only compare arrays of sort
keys, and those tests are fine.
But other tests used a trick of converting a list of items into a
of set_of_frozen_elements() and compare this sets. This trick is almost
correct, but it can miss cases where items repeat.
So in this patch, we replace the set_of_frozen_elements() approach by
a similar one using a multiset (set with repetitions) instead of a set.
A multiset in Python is "collections.Counter". This is the same approach
we started to also used in test_scan.py in a recent patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Remove the incomplete and unused function to convert DynamoDB type names
to ScyllaDB type objects:
DynamoDB has a different set of types relevant for keys and for attributes.
We already have a separate function, parse_key_type(), for parsing key
types, and for attributes - we don't currently parse the type names at
all (we just save them as JSON strings), so the function we removed here
wasn't used, and was in fact #if'ed out. It was never completed, and it now
started to decay (the type for numbers is wrong), so we're better off
completely removing it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch implements a fully working number type for keys, and now
Alternator fully and correctly supports every key type - strings, byte
arrays, and numbers.
The patch also adds a test which verifies that Scylla correctly sorts
number sort keys, and also correctly retrieves them to the full precision
guaranteed by DynamoDB (38 decimal digits).
The implementation uses Scylla's "decimal" type, which supports arbitrary
precision decimal floating point, and in particular supports the precision
specified by DynamoDB. However, "decimal" is actually over-qualified for
this use, so might not be optimal for the more specific requirements of
DynamoDB. So a FIXME is left to optimize this case in the future.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Comparing two lists of items without regard for order is not trivial.
test_scan.py currently has two ways of doing this, both unsatisfactory:
1. We convert each list to a set via set_of_frozen_elements(), and compare
the sets. But this comparison can miss cases where items repeat.
2. We use sorted() on the list. This doesn't work on Python 3 because
it removed the ability to compare (with "<") dictionaries.
So in this patch, we replace both by a new approach, similar to the first
one except we use a multiset (set with repetitions) instead of a set.
A multiset in Python is "collections.Counter".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Creating and deleting tables is the slowest part of our tests,
so we should lower the number of tables our tests create.
We had a "test_2_tables" fixture as a way to create two
tables, but since our tests already create other tables
for testing different key types, it's faster to reuse those
tables - instead of creating two more unused tables.
On my system, a "pytest --local", running all 38 tests
locally, drops from 25 seconds to 20 seconds.
As a bonus, we also have one fewer fixture ;-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
to 1024 bytes, and the entire item to 400 KB which therefore also
limits the size of one attribute. This test checks that we can
reach up to these limits, with binary keys and attributes.
The test does *not* check what happens once we exceed these
limits. In such a case, DynamoDB throws an error (I checked that
manually) but Alternator currently simply succeeds. If in the
future we decide to add artificial limits to Alternator as well,
we should add such tests as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"len" is an unfortunate choice for a variable name, in case one
day the implementation may want to call the built-in "len" function.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have a test for *string* sort-key ordering of items returned
by the Scan operation, and this test adds a similar test for the Query
operation. We verify that items are retrieved in the desired sorted
order (sorted by the aptly-named sort key) and not in creation order
or any other wrong order.
But beyond just checking that Query works as expected (it should,
given it uses the same machinary as Scan), the nice thing about this
test is that it doesn't create a new table - it uses a shared table
and creates one random partition inside it. This makes this test
faster and easier to write (no need for a new fixture), and most
importantly - easily allows us to write similar tests for other
key types.
So this patch also tests the correct ordering of *binary* sort keys.
It helped exposed bugs in previous versions of the binary key implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Simple tests for item operations (PutItem, GetItem) with binary key instead
of string for the hash and sort keys. We need to be able to store such
keys, and then retrieve them correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Until now we only supported string for key columns (hash or sort key).
This patch adds support for the bytes type (a.k.a binary or blob) as well.
The last missing type to be supported in keys is the number type.
Note that in JSON, bytes values are represented with base64 encoding,
so we need to decode them before storing the decoded value, and re-encode
when the user retrieves the value. The decoding is important not just
for saving storage space (the encoding is 4/3 the size of the decoded)
but also for correct *sorting* of the binary keys.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The DynamoDB API uses base64 encoding to encode binary blobs as JSON
strings. So we need functions to do these conversions.
This code was "inspired" by https://github.com/ReneNyffenegger/cpp-base64
but doesn't actually copy code from it.
I didn't write any specific unit tests for this code, but it will be
exercised and tested in a following patch which tests Alternator's use
of these functions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
BEGINS_WITH behaves in a special way when a key postfix
consists of <255> bytes. The initial test does not use that
and instead checks UTF-8 characters, but once bytes type
is implemented for keys, it should also test specifically for
corner cases, like strings that consist of <255> byte only.
Message-Id: <fe10d7addc1c9d095f7a06f908701bb2990ce6fe.1558603189.git.sarna@scylladb.com>
BEGINS_WITH statement increments a string in order to compute
the upper bound for a clustering range of a query.
Unfortunately, previous implementation was not correct,
as it appended a <0> byte if the last character was <255>,
instead of incrementing a last-but-one character.
If the string contains <255> bytes only, the upper bound
of the returned upper bound is infinite.
Message-Id: <3a569f08f61fca66cc4f5d9e09a7188f6daad578.1558524028.git.sarna@scylladb.com>
We had several places in the code that need to parse the
ConsistentRead flag in the request. Let's add a function
that does this, and while at it, checks for more error
cases and also returns LOCAL_QUORUM and LOCAL_ONE instead
of QUORUM and ONE.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As Shlomi suggested in the past, it is more likely that when we
eventually support global tables, we will use LOCAL_QUORUM,
not QUORUM. So let's switch to that now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
So far, all of the tests in test_item.py (for PutItem, GetItem, UpdateItem),
were arbitrarily done on a test table with both hash key and sort key
(both with string type). While this covers most of the code paths, we still
need to verify that the case where there is *not* a sort key, also works
fine. E.g., maybe we have a bug where a missing clustering key is handled
incorrectly or an error is incorrectly reported in that case?
But in this patch we add tests for the hash-key-only case, and see that
it already works correctly. No bug :-)
We add a new fixture test_table_s for creating a test table with just
a single string key. Later we'll probably add more of these test tables
for additional key types.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Another type of key type error can be to forget part of the key
(the hash or sort key). Let's test that too (it already works correctly,
no need to patch the code).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When a table has a hash key or sort key of a certain type (this can
be string, bytes, or number), one cannot try to choose an item using
values of different types.
We previously did not handle this case gracefully, and PutItem handled
it particularly bad - writing malformed data to the sstable and basically
hanging Scylla. In this patch we fix the pk_from_json() and ck_from_json()
functions to verify the expected type, and fail gracefully if the user
sent the wrong type.
This patch also adds tests for these failures, for the GetItem, PutItem,
and UpdateItem operations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
According to the documentation, trying to GetItem a non-existant item
should result in an empty response - NOT a response with an empty "Item"
map as we do before this patch.
This patch fixes this case, and adds a test case for it. As usual,
we verify that the test case also works on Amazon DynamoDB, to verify
DynamoDB really behaves the way we thik it does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If an empty item (i.e., no attributes except the key) is created, or an item
becomes empty (by deleting its existing attributes), the empty item must be
maintained - it cannot just disappear. To do this in Scylla, we must add a
row marker - otherwise an empty attribute map is not enough to keep the
row alive.
This patch includes 4 test cases for all the various ways an empty item can be
created empty or non-empty item be emptied, and verifies that the empty item
can be correctly retrieved (as usual, to verify that our expectation of
"correctness" is indeed correct, we run the same tests against DynamoDB).
All these 4 tests failed before this patch, and now succeed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
These lines of codes were superfluous and their result unused: the
make_item_mutation() function finds the pk and ck on its own.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
his patch adds a statistics framework to Alternator: Executor has (for
each shard) a _stats object which contains counters for various events,
and also is in charge of making these counters visible via Scylla's regular
metrics API (http://localhost:9180/metrics).
This patch includes a counter for each of DynamoDB's operation types,
and we increase the ones we support when handled. We also added counters
for total operations and unsupported operations (operation types we don't
yet handle). In the future we can easily add many more counters: Define
the counter in stats.hh, export it in stats.cc, and increment it in
where relevant in executor.cc (or server.cc).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Ask to retrieve only an attribute name which *none* of the items have.
The result should be a silly list of empty items, and indeed it is.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Use full_scan() in another test instead of open-coding the scan.
There are two more tests that could have used full_scan(), but
since they seem to be specifically adding more assertions or
using a different API ("paginators"), I decided to leave them
as-is. But new tests should use full_scan().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is a short, but extensive, test to the AttributesToGet parameter
to Scan, allowing to select for output only some of the attributes.
The AttributesToGet feature has several non-obvious features. Firstly,
it doesn't require that any key attributes be selected. So since each
item may have different non-key attributes, some scanned items may
be missing some of the selected columns, and some of the items may
even be missing *all* the selected columns - in which case DynamoDB
returns an empty item (and doesn't entirely skip this item). This
test covers all these cases, and it adds yet another item to the
'filled_test_table' fixture, one which has different attributes,
so we can see these issues.
As usual, this test passes in both DynamoDB and Alternator, to
assure we correspond to the *right* behavior, not just what we
think is right.
This test actually exposed a bug in the way our code returned
empty items (items which had none of the selected columns),
a bug which was fixed by the previous patch.
Instead of having yet another copy of table-scanning code, this
patch adds a utility function full_scan(), to scan an entire
table (with optional extra parameters for the scan) and return
the result as an array. We should simply existing tests in
test_scan.py by using this new function.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* tag 'dbuild-image-help-usage-v1' of github.com:bhalevy/scylla:
dbuild: add usage
dbuild: add help option
dbuild: list available images when no image arg is given
dbuild: add --image option
When a Scan selects only certain attributes, and none of the key
attributes are selected, for some of the scanned items *nothing*
will remain to be output, but still Dynamo outputs an empty item
in this case. Our code had a bug where after each item we "moved"
the object leaving behind a null object, not an empty map, so a
completely empty item wasn't output as an empty map as expected,
and resulted in boto3 failing to parse the response.
This simple one-line patch fixes the bug, by resetting the item
to an empty map after moving it out.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Instead of blindly returning "localhost:8000" in response to
DescribeEndpoints and for sure causing us problems in the future,
the right thing to do is to return the same domain name which the
user originally used to get to us, be it "localhost:8000" or
"some.domain.name:1234". But how can we know what this domain name
was? Easy - this is why HTTP 1.1 added a mandatory "Host:" header,
and the DynamoDB driver I tested (boto3) adds it as expected,
indeed with the expected value of "localhost:8000" on my local setup.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although different partitions are returned by a Scan in (seemingly)
random order, items in a single partition need to be returned sorted
by their sort key. This adds a test to verify this.
This patch adds to the filled_test_table fixture, which until now
had just one item in each partition, another partition (with the key
"long") with 164 additional items. The test_scan_sort_order_string
test then scans this table, and verifies that the items are really
returned in sorted order.
The sort order is, of course, string order. So we have the first
item with sort key "1", then "10", then "100", then "101", "102",
etc. When we implement numeric keys we'll need to add a version
of this test which uses a numeric clustering key and verifies the
sort order is numeric.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Because of a typo, we incorrectly set the table's sort key as a second
partition key column instead of a clustering key column. This has bad
but subtle consequences - such as that the items are *not* sorted
according to the sort key. So in this patch we fix the typo.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DescribeEndpoints is not a very important API (and by default, clients
don't use it) but I wanted to understand how DynamoDB responds to it,
and what better way than to write a test :-)
And then, if we already have a test, let's implement this request in
Scylla as well. This is a silly implementation, which always returns
"localhost:8000". In the future, this will need to be configurable -
we're not supposed here to return *this* server's IP address, but rather
a domain name which can be used to get to all servers.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
Currently, GDB scripts locate sstables by scanning the heap for
bag_sstable_set containers. That has disadvatanges:
- not all containers are considered
- it's extremely slow on large heaps
- fragile, new containers can be added, and we won't even know
This series fixes all above by adding a per-shard sstable tracker
which tracks sstable objects in a linked-list.
"
* 'sstable-tracker' of github.com:tgrabiec/scylla:
gdb: Use sstable tracker to get the list of sstables
gdb: Make intrusive_list recognize member_hook links
sstables: Track whether sstable was already open or not
sstables: Track all instances of sstable objects
sstables: Make sstable object not movable
sstables: Move constructor out of line
Most of the request types need to a TableName parameter, specifying the
name of the table they operate on. There's a lot of boilerplate code
required to get this table name and verify that it is valid (the parameter
exists, is a string, passes DynamoDB's naming rules, and the table
actually exists), which resulted in a lot of code duplication - and
in some cases missing checks.
So this patch introduces two utility functions, get_table_name()
and get_table(), to fetch a table name or the schema of an existing
table, from the request, with all necessary validation. If validation
fails, the appropriate api_error() is thrown so the user gets the
right error message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Remove unused random-string code from conftest.py, and also add a
TODO comment how we should speed up filled_test_table fixture by
using a batch write - when that becomes available in Alternator.
(right now this fixture takes almost 4 seconds to prepare on a local
Alternator, and a whopping 3 minutes (!) to prepare on DynamoDB).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_put_and_get_attribute_types needlessly named all the
different attributes and their variables, causing a lot of repetition
and chance for mistakes when adding additional attributes to the test.
In this rewrite, we only have a list of items, and automatically build
attributes with them as values (using sequential names for the attributes)
and check we read back the same item (Python's dict equality operator
checks the equality recursively, as expected).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although we planned to initially support only string types, it turns out
for the attributes (*not* the key), we actually support all types already,
including all scalar types (string, number, bool, binary and null) and
more complex types (list, nested document, and sets).
This adds a tests which PutItem's these types and verifies that we can
retrieve them.
Note that this test deals with top-level attributes only. There is no
attempt to modify only a nested attribute (and with the current code,
it wouldn't work).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In our tests, we cannot really assume that ListTables should returns *only*
the tables we created for the test, or even that a page size of 100 will
be enough to list our 3 pages. The issue is that on a shared DynamoDB, or
in hypothetical cases where multiple tests are run in parallel, or previous
tests had catestrophic errors and failed to clean up, we have no idea how
many unrelated tables there are in the system. There may be hundreds of
them. So every ListTables test will need to use paging.
So in this re-implementation, we begin with a list_tables() utility function
which calls ListTables multiple times to fetch all tables, and return the
resulting list (we assume this list isn't so huge it becomes unreasonable
to hold it in memory). We then use this utility function to fetch the table
list with various page sizes, and check that the test tables we created are
listed in the resulting list.
There's no longer a separate test for "all" tables (really was a page of 100
tables) and smaller pages (1,2,3,4) - we now have just one test that does the
page sizes 1,2,3,4, 50 and 100.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch cleans up some comments and reorganizes some functions in
conftest.py, where the test_table fixture was defined. The goal is to
later add additional types of test tables with different schemas (e.g.,
just a partition key, different key types, etc.) without too much
code duplication.
This patch doesn't change anything functional in the tests, and they
still pass ("pytest --local" runs all tests against the local Alternator).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The ck_from_json() utility function is easier to use if it handles
the no-clustering-key case as the callers need them too, instead of
requiring them to handle the no-clustering-key case separately.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
So far we supported UpdateItem only with PUT operations - this patch
adds support for DELETE operations, to delete specific attributes from
an item.
Only the case of a missing value is support. DynamoDB also provides
the ability to pass the old value, and only perform the deletion if
the value and/or its type is still up-to-date - but we don't support
this yet and fail such request if it is attempted.
This patch also includes a test for this case in alternator-test/
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add initial tests for UpdateItem. Only the features currently supported
by our code (only string attributes, only "PUT" action) are tested.
As usual, this test (like all others) was tested to pass on both DynamoDB
and Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add an initial UpdateItem implementation. As PutItem and GetItem we
are still limited to string attributes. This initial implementation
of UpdateItem implements only the "PUT" action (not "DELETE" and
certainly not "ADD") and not any of the more advanced options.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All operation-generated error messages should have the 400 HTTP error
code. It's a real nag to have to type it every time. So make it the
default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Without special options, PutItem should return nothing (an empty
JSON result). Previously we had trouble doing this, because instead
of return an empty JSON result, we converted an empty string into
JSON :-) So the existing code had an ugly workaround which worked,
sort of, for the Python driver but not for the Java driver.
The correct fix, in this patch, is to invent a new type json_string
which is a string *already* in JSON and doesn't need further conversion,
so we can use it to return the empty result. PutItem now works from
YCSB's Java driver.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although we would like to allow table names up to 222 bytes, this is not
currently possible because Scylla tacks additional 33 bytes to create
a directory name, and directory names are limited to 255 bytes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The supported key types are just S(tring), B(lob), or N(umber).
Other types are valid for attributes, but not for keys, and should
not be accepted. And wrong types used should result in the appropriate
user-visible error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To be correct, CreateTable's input parsing need to work in reverse from
what it did: First, the key columns are listed in KeySchema, and then
each of these (and potetially more, e.g., from indexes) need to appear
AttributeDefinitions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Without any arguments, PutItem should return no data at all. But somehow,
for reasons I don't understand, the boto3 driver gets confused from an
empty JSON thinking it isn't JSON at all. If we return a structure with
an empty "attributes" fields, boto3 is happy.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add an initial implementation of Delete table, enough for making the
pytest --local test_table.py::test_create_and_delete_table
Pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When given an unknown operation (we didn't implement yet many of them...)
we should throw the appropriate api_error, not some random exception.
This allows the client to understand the operation is not supported
and stop retrying - instead of retrying thinking this was a weird
internal error.
For example the test
pytest --local test_table.py::test_create_and_delete_table
Now fails immediately, saying Unsupported operation DeleteTable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The structure's name in DescribeTable's output is supposed to be called
"Table", not "TableDescription". Putting in the wrong place caused the
driver's table creation waiters to fail.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
validate table name in CreateTable, and if it doesn't fit DynamoDB's
requirement, return the appropriate error as drivers expect.
With this patch, test_table.py::test_create_table_unsupported_names
now passes (albeit with a one minute pause - this a bug with keep-alive
support...).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Check the expected error message to contain just ValidationException
instead of an overly specific text message from DynamoDB, so we aren't
so constraint in our own messages' wording.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Dynamo allows tables names up to 255 characters, but when this is tested on
Alternator, the results are disasterous: mkdir with such a long directory
name fails, Scylla considers this an unrecoverable "I/O error", and exits
the server.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Start to use "test fixtures" defined in conftest.py: The connection to
the DynamoDB API, and also temporary tables, can be reused between multiple
tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This initial implementation is enough to pass a test of getting a
failure for a non-existant table -
test_table.py::test_describe_table_non_existent_table
and to recognize an existing table. But it's still missing a lot
of fields for an existing table (among others, the schema).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Exceptions from the handlers need to be output in a certain way - as
a JSON with specific fields - as DynamoDB drivers expect them to be.
If a handler throws an alternator::api_error with these specific fields,
they are output, but any other exception is converted into the same
format as an "Internal Error".
After this patch, executor code can throw an alternator::api_error and
the client will receive this error in the right format.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB error messages are returned in JSON format and expect specific
information: Some HTTP error code (often but not always 400), a string
error "type" and a user-readable message. Code that wants to return
user-visible exceptions should use this type, and in the next patch we
will translate it to the appropriate JSON string.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "Timestamp" type returned for CreationDateTime can be one of several
things but if it is a number, it is supposed to be the time in *seconds*
since the epoch - not in milliseconds. Returning milliseconds as we
wrongly did causes boto3 (AWS's Python driver) to throw a parse exception
on this response.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Until now, we always opened the Alternator port along with Scylla's
regular ports (CQL etc.). This should really be made optional.
With this patch, by default Alternator does NOT start and does not
open a port. Run Scylla with --alternator-port=8000 to open an Alternator
API port on port 8000, as was the default until now. It's also possible
to set this in scylla.yaml.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The interface works on port 8000 by default and provides
the most basic alternator operations - it's an incomplete
set without validation, meant to allow testing as early as possible.
Some sstable objects correspond to sstables which are being written
and are not sealed yet. Such sstables don't have all the fields
filled-in. Tools which calculate statistics (like GDB scripts) need to
distinguish such sstables.
There is no reason to keep parts of the the Scylla Metadata component in memory
after it is read, parsed, and its information fed into the SSTable.
We have seen systems in which the Scylla metadata component is one
of the heaviest memory users, more than the Summary and Filter.
In particular, we use the token metadata, which is the largest part of the
Scylla component, to calculate a single integer -> the shards that are
responsible for this SSTable. Once we do that, we never use it again
Tests: unit (release/debug), + manual scylla write load + reshard.
Fixes#4951
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Introduce mutation_fragment_stream_validator class and use it as a
Filter to flat_mutation_reader::consume_in_thread from
sstable::write_components to validate partition region and optionally
clustering key monotonicity.
Fixes#4803
key monotonicity validation requires an overhead to store the last key and also to compare
therefore provide an option to enable/disable it (disabled by default).
Refs #4804
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Storing and comparing keys is expensive.
Add a flag to enable/disable this feature (disabled by default).
Without the flag, only the partition region monotonicity is
validated, allowing repeated clustering rows, regardless of
clustering keys.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The respective constructor is explicit.
Define this assignment operator to be used by flat_mutation_reader
mutation_fragment_stream_validator filter so that it can use
mutation_fragment::position() verbatim and keep its internal
state as position_in_partition.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Recently we started to use more the concept of metric labels - several
metrics which share the same name, but differ in the value of some label
such a "group" (for different scheduling groups).
This patch documents this feature in docs/metrics.md, gives the example of
scheduling groups, and explains a couple more relevant Promethueus syntax
tricks.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190909113803.15383-1-nyh@scylladb.com>
* seastar cb7026c16f...b3fb4aaab3 (10):
> Revert "scheduling groups: Adding per scheduling group data support"
> scheduling groups: Adding per scheduling group data support
> rpc: check that two servers are not created with the same streaming id
> future: really ignore exceptions in ignore_ready_future
> iostream: Constify eof() function
> apply.hh: add missing #include for size_t
> scheduling_group_demo: add explicit yields since future::get() no longer does
> Fix buffer size used when calling accept4()
> future-util: reduce allocations and continuations in parallel_for_each
> rpc: lz4_decompressor: Add a static constexpr variable decleration for Cpp14 compatibility
Previously, the gate could get
closed too early, which would result in shutting down the server
before it had an opportunity to respond to the client.
Refs #4818
"
The release notes for boost 1.67.0 includes:
Breaking Change: When converting a multiprecision integer to a narrower type, if the value is too large (or negative) to fit in the smaller type, then the result is either the maximum (or minimum) value of the target
Since we just moved out of boost 1.66, we have to update our code.
This fixes issue #4960
"
* 'espindola/fix-4960' of https://github.com/espindola/scylla:
types: fix varint to integer conversion
types: extract a from_varint_to_integer from make_castas_fctn_from_decimal_to_integer
types: fix decimal to integer conversion
types: extract helper for converting a decimal to a cppint
types: rename and detemplate make_castas_fctn_from_decimal_to_integer
"
With this patch series one has to be explicit to create a date_type_impl and now there is only the one documented difference between date_type_impl and timestamp_type_impl.
"
* 'espindola/simplify-date-type' of https://github.com/espindola/scylla:
types: Reduce duplication around date_type_impl
types: Don't use date_type_native_type when we want a timestamp
types: Remove timestamp_native_type
types: Don't specialize data_type_for for db_clock::time_point
types: Make it harder to create date_type
According to the comments, the only different between date_type_impl
and timestamp_type_impl is the comparison function.
This patch makes that explicit by merging all code paths except:
* The warning when converting between the two
* The compare function
The date_type_impl type can still be user visible via very old
sstables or via the thrift protocol. It is not clear if we still need
to support either, but with this patch it is easy to do so.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
In these cases it is pretty clear that the original code wanted to
create a timestamp_type data_value but was creating a date_type one
because of the old defaults.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Now that we know that anything expecting a date_type has been
converted to date_type_native_type, switch to using
db_clock::time_point when we want a timestamp_type.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This also moves every user to date_type_native_type. A followup patch
will convert to timestamp_type when appropriate.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
date_type was replaced with timestamp_type, but it was very easy to
create a date_type instead of a timestamp_type by accident.
This patch changes the code so that a date_type is no longer
implicitly used when constructing a data_value. All existing code that
was depending on this is converted to explicitly using
date_type_native_type. A followup patch will convert to timestamp_type
when appropriate.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Commit 7e3805ed3d removed the load balancing code from cql
server, but it did not remove most of the craft that load balancing
introduced. The most of the complexity (and probably the main reason the
code never worked properly) is around service::client_state class which
is copied before been passed to the request processor (because in the past
the processing could have happened on another shard) and then merged back
into the "master copy" because a request processing may have changed it.
This commit remove all this copying. The client_request is passed as a
reference all the way to the lowest layer that needs it and it copy
construction is removed to make sure nobody copies it by mistake.
tests: dev, default c-s load of 3 node cluster
Message-Id: <20190906083050.GA21796@scylladb.com>
"
This avoids a double dispatch on _kind and also removes a few shared_ptr copies.
The extra work was a small regression from the recent types refactoring.
"
* 'espindola/optimize_type_find' of https://github.com/espindola/scylla:
types: optimize type find implementation
types: Avoid shared_ptr copies
Currently when an error happens during the receive and distribute phase
it is swallowed and we just return a -1 status to the remote. We only
log errors that happen during responding with the status. This means
that when streaming fails, we only know that something went wrong, but
the node on which the failure happened doesn't log anything.
Fix by also logging errors happening in the receive and distribute
phase. Also mention the phase in which the error happened in both error
log messages.
Refs: #4901
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190903115735.49915-1-bdenes@scylladb.com>
The previous code was using the boost::multiprecision::cpp_int to
integer conversion, but that doesn't have the same semantics an cql
for signed numbers.
This fixes the dtest cql_cast_test.py:CQLCastTest.cast_varint_test.
Fixes#4960
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The previous code was using the boost::multiprecision::cpp_rational to
integer conversion, but that doesn't have the same semantics an cql.
This patch avoids creating a cpp_rational in the first place and works
just with integers.
This fixes the dtest cql_cast_test.py:CQLCastTest.cast_decimal_test.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
on_down() iterates over _view_update_handlers_list, but it yields during iteration,
and while it yields, elements in that list can be removed, resulting in a
use-after-free.
Prevent this by registering iterators that can be potentially invalidated, and
any time we remove an element from the list, check whether we're removing an element
that is being pointed to by a live iterator. If that is the case, advance the iterator
so that it points at a valid element (or at the end of the list).
Fixes#4912.
Tests: unit (dev)
Currently the background schema sync (push/pull) uses frozen mutation to
send the schema mutations over the wire to the remote node. For this to
work correctly, both nodes have to have the exact same schema for the
system schema tables, as attempting to unpack the frozen mutation with
the wrong schema leads to undefined behaviour.
To avoid this and to ensure syncing schema between nodes with different
schema table schema versions is defined we migrate the background
schema sync to use canonical mutations for the transfer of the schema
mutations. Canonical mutations are immune to this problem, as they
support deserializing with any version of the schema, older or newer
one.
The foreground schema sync mechanisms -- the on-demand schema pulls on
reads and writes -- already use canonical mutations to transmit the
schema mutations.
It is important to note that due to this change, column-level
incompatibilities between the schema mutations and the schema used to
deserialize them will be hidden. This is undesired and should be fixed
in a follow-up (#4956). Table level incompatibilities are detected and
schema mutations containing such mutations will be rejected just like before.
This patch adds canonical mutation support to the two background schema
sync verbs:
* `DEFINITIONS_UPDATE` (schema push)
* `MIGRATION_REQUEST` (schema pull)
Both verbs still support the old frozen mutation schema transfer, albeit
that path is now much less efficient. After all nodes are upgraded, the
pull verb can effectively avoid sending frozen mutations altogether,
completely migrating to canonical mutations. Unfortunately this was not
possible for the push verb, so that one now has an overhead as it needs
to send both the frozen and canonical mutations.
Fixes: #4273
The previous code was not exception safe and would eventually cause a
file to be destroyed without being closed, causing an assert failure.
Unfortunately it doesn't seem to be possible to test this without
error injection, since using an invalid directory fails before this
code is executed.
Fixes#4948
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190904002314.79591-1-espindola@scylladb.com>
The verbs are:
* DEFINITIONS_UPDATE (push)
* MIGRATION_REQUEST (pull)
Support was added in a backward-compatible way. The push verb, sends
both the old frozen mutation parameter, and the new optional canonical
mutation parameter. It is expected that new nodes will use the latter,
while old nodes will fall-back to the former. The pull verb has a new
optional `options` parameter, which for now contains a single flag:
`remote_supports_canonical_mutation_retval`. This flag, if set, means
that the remote node supports the new canonical mutation return value,
thus the old frozen mutations return value can be left empty.
In preparation to the schema push/pull migrating to use canonical
mutations, convert the method producing the schema mutations to return a
vector of canonical mutations. The only user, MIGRATION_REQUEST verb,
converts the canonical mutations back to frozen mutations. This is very
inefficient, but this path will only be used in mixed clusters. After
all nodes are upgraded the verb will be sending the canonical mutations
directly instead.
This turns find into a template so there is only one switch over the
kind of each type in the search.
To evaluate the change in code size sizes, I added [[noinline]] to
find and obtained the following results.
The release columns for release in the before case have an extra column
because the functions are sufficiently complex to trigger gcc to split
them in hot + cold.
before:
dev release (hot + cold split)
find 0x35f = 863 0x3d5 + 0x112 = 1255
references_duration 0x62 + 0x22 + 0x8 = 140 0x55 + 0x1f + 0x2a + 0x8 = 166
references_user_type 0x6b + 0x26 + 0x111 = 418 0x65 + 0x1f + 0x32 + 0x11b = 465
after:
dev release
find 0xd6 + 0x1b4 = 650 0xd2 + 0x1f5 = 711
references_duration 0x13 = 19 0x13 = 19
references_user_type 0x1a = 26 0x21 = 33
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
They are somewhat expensive (in code size at least) and not needed
everywhere.
Inside the getter the variables are 'const data_type&', so we can
return that. Everything still works when a copy is needed, but in code
that just wants to check a property we avoid the copy.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
During CQL request processing, a gate is used to ensure that
the connection is not shut down until all ongoing requests
are done. However, the gate might have been left too early
if the database was not ready to respond immediately - which
could result in trying to respond to an already closed connection
later. This issue is solved by postponing leaving the gate
until the continuation chain that handles the request is finished.
Refs #4808
* 'cleanup_sstables' of https://github.com/asias/scylla:
sstables: Move leveled_compaction_strategy implementation to source file
sstables: Include dht/i_partitioner.hh for dht::partition_range
Since nonroot mode requires to run everything on non-privileged user,
most of setup scripts does not able to use nonroot mode.
We only provide following functions on nonroot mode:
- EC2 check
- IO setup
- Node exporter installer
- Dev mode setup
Rest of functions will be skipped on scylla_setup.
To implement nonroot mode on setup scripts, scylla_util provides
utility functions to abstract difference of directory structure between normal
installation and nonroot mode.
Since systemd unit can override parameters using drop-in unit, we don't need
mustache template for them.
Also, drop --disttype and --target options on install.sh since it does not
required anymore, introduce --sysconfdir instead for non-redhat distributions.
Since ac9b115, we switched to install.sh on Debian so we don't rely on .deb
specific packaging scripts anymore.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Now that compaction returns only after the compacted sstables are
deleted we no longer need to stop the base to force waiting
for deletes (that were previously done asynchronously)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Merged patch series by Amnon Heiman:
This patch fixes a bug that a map is held on the stack and then is used
by a future.
Instead, the map is now moved to the relevant lambda function.
Fixes#4824
Stopping services which occurs in a destructor of deferred_action
should not throw, or it will end the program with
terminate(). View builder breaks a semaphore during its shutdown,
which results in propagating a broken_semaphore exception,
which in turn results in throwing an exception during stop().get().
In order to fix that issue, semaphore exceptions are explicitly
ignored, since they're expected to appear during shutdown.
Fixes#4875
To prevent termination with SIGILL, tighten the instruction set
support checks. First, check for CLMUL too. Second, add a check in
scylla_prepare to catch the problem early.
Fixes#4921.
Scylla requires the CLMUL and SSE 4.2 instruction sets and will fail without them.
There is a check in main(), but that happens after the code is running and it may
already be too late. Add a check in scylla_prepare which runs before the main
executable.
"
It is well known that seastar applications, like Scylla, do not play
well with external processes: CPU usage from external processes may
confuse the I/O and CPU schedulers and create stalls.
We have also recently seen that memory usage from other application's
anonymous and page cache memory can bring the system to OOM.
Linux has a very good infrastructure for resource control contributed by
amazingly bright engineers in the form of cgroup controllers. This
infrastructure is exposed by SystemD in the form of slices: a
hierarchical structure to which controllers can be attached.
In true systemd way, the hierarchy is implicit in the filenames of the
slice files. a "-" symbol defines the hierarchy, so the files that this
patch presents, scylla-server and scylla-helper, essentially create a
"scylla" cgroup at the top level with "server" and "helper" children.
Later we mark the Services needed to run scylla as belonging to one
or the other through the Slice= directive.
Scylla DBAs can benefit from this setup by using the systemd-run
utility to fire ad-hoc commands.
Let's say for example that someone wants to hypothetically run a backup
and transfer files to an external object store like S3, making sure that
the amount of page cache used won't create swap pressure leading to
database timeouts.
One can then run something like:
sudo systemd-run --uid=id -u scylla --gid=id -g scylla -t --slice=scylla-helper.slice /path/to/my/magical_backup_tool
(or even better, the backup tool can itself be a systemd timer)
"
* 'slices' of https://github.com/glommer/scylla:
systemd: put scylla processes in systemd slices.
move postinst steps to an external script
"
The warning for discarded futures will only become useful, once we can
silence all present warnings and flip the flag to make it become error.
Then it will start being useful in finding new, accidental discarding of
futures.
This series silences all remaining warnings in the Scylla codebase. For
those cases where it was obvious that the future is discarded on
purpose, the author taking all necessary precaution (handling exception)
the warning was simply silenced by casting the future to void and
adding a relevant comment. Where the discarding seems to have been done
in error, I have fixed the code to not discard it. To the rest of the
sites I added a FIXME to fix the discarding.
"
* 'resolve-discarded-future-warnings/v4.2' of https://github.com/denesb/scylla:
treewide: silence discarded future warnings for questionable discards
treewide: silence discarded future warnings for legit discards
tests: silence discarded future warnings
tests/cql_query_test.cc: convert some tests to thread
This patches silences the remaining discarded future warnings, those
where it cannot be determined with reasonable confidence that this was
indeed the actual intent of the author, or that the discarding of the
future could lead to problems. For all those places a FIXME is added,
with the intent that these will be soon followed-up with an actual fix.
I deliberately haven't fixed any of these, even if the fix seems
trivial. It is too easy to overlook a bad fix mixed in with so many
mechanical changes.
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
Some tests are currently discarding futures unjustifiably, however
adding code to wait on these futures is quite inconvenient due to the
continuation style code of these tests. Convert them to run in a seastar
thread to make the fix easier.
Introduced in c96ee98.
We call update_schema_version() after features are enabled and we
recalculate the schema version. This method is not updating gossip
though. The node will still use it's database::version() to decide on
syncing, so it will not sync and stay inconsistent in gossip until the
next schema change.
We should call updatE_schema_version_and_announce() instead so that
the gossip state is also updated.
There is no actual schema inconsistency, but the joining node will
think there is and will wait indefinitely. Making a random schema
change would unbock it.
Fixes#4647.
Message-Id: <1566825684-18000-1-git-send-email-tgrabiec@scylladb.com>
* seastar afc5bbf511...20bfd61955 (18):
> reactor: closing file used to check if direct_io is supported
> future: set_coroutine(): s/state()/_state/
> tests/perf/perf_test.hh: suppress discarded future warning
> tests: rpc: fix memory leak in timeout wraparound tests
> Revert "future-util: reduce allocations and continuations in parallel_for_each"
> reactor: fix rename_priority_class() build failure in C++14 mode
> future: mark future_state_base::failed() as unlikely
> future-util: reduce allocations and continuations in parallel_for_each
> future-utils: generalize when_all_estimate_vector_capacity()
> output_stream: Add comment on sequentiality
> docs/tutorial.md: minor cleanups in first section
> core: fix a race in execution stages (Fixes#4856, fixes#4766)
> semaphore: use semaphore's clock type in with_semaphore()/get_units()
> future: fix doxygen documentation for promise<>
> sharded: fixed detecting stop method when building with clang
> reactor: fixed locking error in rename_priority_class
> Assert that append_challenged_posix_file_impl are closed.
> rpc: correctly handle huge timeouts
Merged patch series from Amnon Heiman amnon@scylladb.com
This Patch adds an implementation of the get built index API and remove a
FIXME.
The API returns a list of secondary indexes belongs to a column family
and have already been fully built.
Example:
CREATE KEYSPACE scylla_demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
CREATE TABLE scylla_demo.mytableID ( uid uuid, text text, time timeuuid, PRIMARY KEY (uid, time) );
CREATE index on scylla_demo.mytableID (time);
$ curl -X GET 'http://localhost:10000/column_family/built_indexes/scylla_demo%3Amytableid'
["mytableid_time_idx"]
The sum_ratio struct is a helper struct that is used when calculating
ratio over multiple shards.
Originally it was created thinking that it may need to use future, in
practice it was never used and the future was ignore.
This patch remove the future from the implementation and reduce an
unhandle future warning from the compilation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This Patch adds an implementation of the get build index API and remove a
FIXME.
The API returns the list of the built secondary indexes belongs to a column family.
Example:
CREATE KEYSPACE scylla_demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
CREATE TABLE scylla_demo.mytableID ( uid uuid, text text, time timeuuid, PRIMARY KEY (uid, time) );
CREATE index on scylla_demo.mytableID (time);
$ curl -X GET 'http://localhost:10000/column_family/built_indexes/scylla_demo%3Amytableid'
["mytableid_time_idx"]
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When a role is created through the `create role` statement, the
'is_superuser' and 'can_login' columns are set to false by default.
Likewise, `list roles`, `alter roles` and `* roles` operations
expect to find a boolean when reading the same columns.
This is not the case, though, when a user directly inserts to
`system_auth.roles` and doesn't set those columns. Even though
manually creating roles is not a desired day-to-day operation,
it is an insert just like any other and it should work.
`* roles` operations, on the other hand, are not prepared for
this deviations. If a user manually creates a role and doesn't
set boolean values to those columns, `* roles` will return all
sorts of errors. This happens because `* roles` is explicitly
expecting a boolean and casting for it.
This patch makes `* roles` more friendly by considering the
boolean variable `false` - inside `* roles` context - if the
actual value is `null`; it won't change the `null` value.
Fixes#4280
Signed-off-by: Juliana Oliveira <juliana@scylladb.com>
Message-Id: <20190816032617.61680-1-juliana@scylladb.com>
The scylla_blocktune.py has a FIXME for btrfs from 2016, which is no
longer relevant for Scylla deployments, as Red Hat dropped support for
the file system in 2017.
Message-Id: <20190823114013.31112-1-penberg@scylladb.com>
The priority class the shard reader was created with was hardcoded to be
`service::get_local_sstable_query_read_priority()`. At the time this
code was written, priority classes could not be passed to other shards,
so this method, receiving its priority class parameters from another
shard, could not use it. This is now fixed, so we can just use whatever
the caller wants us to use.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190823115111.68711-1-bdenes@scylladb.com>
Cartesian products (generated by IN restrictions) can grow very large,
even for short queries. This can overwhelm server resources.
Add limit checking for cartesian products, and configuration items for
users that are not satisfied with the default of 100 records fetched.
Fixes#4752.
Tests: unit (dev), manual test with SIGHUP.
Cartesian products (via IN restrictions) make it easy to generate huge
primary key sets with simple queries, overflowing server resources. Limit
them in the coordinator and report an exception instead of trying to
execute a query that would consume all of our memory.
A unit test is added.
We need a way to configure the cql interpreter and runtime. So far we relied
on accessing the configuration class via various backdoors, but that causes
its own problems around initialization order and testability. To avoid that,
this patch adds an empty cql_config class and propagates it from main.cc
(and from tests) to the cql interpreter via the query_options class, which is
already passed everywhere.
Later patches will fill it with contents.
This was broken since the type refactoring. It was checking the static
type, which is always abstract_type. Unfortunately we only had dtests
for this.
This can probably be optimized to avoid the double switch over kind,
but it is probably better to do the simple fix first.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190821155354.47704-1-espindola@scylladb.com>
Currently we create a regex from the LIKE pattern for every row
considered during filtering, even though the pattern is always the
same. This is wasteful, especially since we require costly
optimization in the regex compiler. Fix it by reusing the regex
whenever the pattern is unchanged since the last call.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The loop over view update handlers used a guard in order to ensure
that the object is not prematurely destroyed (thus invalidating
the iterator), but the guard itself was not in the right scope.
Fixed by replacinga 'for' loop with a 'while' loop, which moves
the iterator incrementation inside the scope in which it's still
guarded and valid.
Fixes#4866
Currently, seastar is built in seastar/build/{mode}. This means we have two build
directories: build/{mode} and seastar/build/{mode}.
This patch changes that to have only a single build directory (build/{mode}). It
does that by calling Seastar's cmake directly instead of through Seastar's
./configure.py. However, to support dpdk, if that is enabled it calls cmake
through Seastar's ./cooking.sh (similar to what Seastar's ./configure.py does).
All ./configure.py flags are translated to cmake variables, in the same way that
Seastar does.
Contains fix from Rafael to pass the flags for the correct mode.
This clarifies that "rows" are clustering rows and that there is no
information about individual collection elements.
The patch also documents some properties common to all these tables.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190820171204.48739-1-espindola@scylladb.com>
The endpoint directories scanned by space_watchdog may get deleted
by the manager::drain_for().
If a deleted directory is given to a lister::scan_dir() this will end up
in an exception and as a result a space_watchdog will skip this round
and hinted handoff is going to be disabled (for all agents including MVs)
for the whole space_watchdog round.
Let's make sure this doesn't happen by serializing the scanning and deletion
using end_point_hints_manager::file_update_mutex.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
end_point_hints_manager::file_update_mutex is taken by space_watchdog
but while space_watchdog is waiting for it the corresponding
end_point_hints_manager instance may get destroyed by manager::drain_for()
or by manager::stop().
This will end up in a use-after-free event.
Let's change the end_point_hints_manager's API in a way that would prevent
such an unsafe locking:
- Introduce the with_file_update_mutex().
- Make end_point_hints_manager::file_update_mutex() method private.
Fixes#4685Fixes#4836
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
If drain_for() is running together with itself: one instance for the local
node and one for some other node, erasing of elements from the _ep_managers
map may lead to a use-after-free event.
Let's serialize drain_for() calls with a semaphore.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Our current relocation works by invoking the dynamic linker with the
executable as an argument. This confuses gdb since the kernel records
the dynamic linker as the executable, not the real executable.
Switch to install-time relocation with patchelf: when installing the
executable and libraries, all paths are known, and we can update the
path to the dynamic loader and to the dynamic libraries.
Since patchelf itself is dynamically linked, we have to relocate it
dynamically (with the old method of invoking it via the dynamic linker).
This is okay since it's a one-time operation and since we don't expect
to debug core dumps of patchelf crashes.
We lose the ability to run scylla directly from the uninstalled
tarball, but since the nonroot installer is already moving in the
direction of requiring install.sh, that is not a great loss, and
certainly the ability to debug is more important.
dh_strip barfs on some binaries which were treated with patchelf,
so exclude them from dh_strip. This doesn't lose any functionality,
since these binaries didn't have debug information to begin with
(they are already-stripped Fedora executables).
Fixes#4673.
Non-full prefix keys are currently not handled correctly as all keys
are treated as if they were full prefixes, and therefore they represent
a point in the key space. Non-full prefixes however represent a
sub-range of the key space and therefore require null extending before
they can be treated as a point.
As a quick reminder, `key` is used to trim the clustering ranges such
that they only cover positions >= then key. Thus,
`trim_clustering_row_ranges_to()` does the equivalent of intersecting
each range with (key, inf). When `key` is a prefix, this would exclude
all positions that are prefixed by key as well, which is not desired.
Fixes: #4839
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190819134950.33406-1-bdenes@scylladb.com>
"
Follow-up to #4610, where a review comment asked for test coverage on all types. Existing tests cover all the types admissible in LIKE, while this PR adds coverage for all inadmissible types.
Tests: unit (dev)
"
* 'like-nonstring' of https://github.com/dekimir/scylla:
cql_query_test: Add LIKE tests for all types
cql_query_test: Remove LIKE-nonstring-pattern case
cql_query_test: Move a testcase elsewhere in file
In b197924, we changed the shutdown process not to rely on the global
reactor-defined exit, but instead added a local variable to hold the
shutdown state. However, we did not propagate that state everywhere,
and now streaming processes are not able to abort.
Fix that by enhancing stop_signal with a sharded<abort_source> member
that can be propagated to services. Propagate it to storage_service
and thence to boot_strapper and range_streamer so that streaming
processes can be aborted.
Fixes#4674Fixes#4501
Tests: unit (dev), manual bootstrap test
"
Streamed view updates parasitized on writing io priority, which is
reserved for user writes - it's now properly bound to streaming
write priority.
Verified manually by checking appropriate io metrics: scylla_io_queue_total_bytes{class="streaming_write" ...} vs scylla_io_queue_total_bytes{class="query" ...}
Tests: unit(dev)
"
* 'assign_proper_io_priority_to_streaming_view_updates' of https://github.com/psarna/scylla:
db,view: wrap view update generation in stream scheduling group
database: assign proper io priority for streaming view updates
Our current relocation works by invoking the dynamic linker with the
executable as an argument. This confuses gdb since the kernel records
the dynamic linker as the executable, not the real executable.
Switch to install-time relocation with patchelf: when installing the
executable and libraries, all paths are known, and we can update the
path to the dynamic loader and to the dynamic libraries.
Since patchelf itself is dynamically linked, we have to relocate it
dynamically (with the old method of invoking it via the dynamic linker).
This is okay since it's a one-time operation and since we don't expect
to debug core dumps of patchelf crashes.
We lose the ability to run scylla directly from the uninstalled
tarball, but since the nonroot installer is already moving in the
direction of requiring install.sh, that is not a great loss, and
certainly the ability to debug is more important.
dh_strip barfs on some binaries which were treated with patchelf,
so exclude them from dh_strip. This doesn't lose any functionality,
since these binaries didn't have debug information to begin with
(they are already-stripped Fedora executables).
Fixes#4673.
It is well known that seastar applications, like Scylla, do not play
well with external processes: CPU usage from external processes may
confuse the I/O and CPU schedulers and create stalls.
We have also recently seen that memory usage from other application's
anonymous and page cache memory can bring the system to OOM.
Linux has a very good infrastructure for resource control contributed by
amazingly bright engineers in the form of cgroup controllers. This
infrastructure is exposed by SystemD in the form of slices: a
hierarchical structure to which controllers can be attached.
In true systemd way, the hierarchy is implicit in the filenames of the
slice files. a "-" symbol defines the hierarchy, so the files that this
patch presents, scylla-server and scylla-helper, essentially create a
"scylla" cgroup at the top level with "server" and "helper" children.
Later we mark the Services needed to run scylla as belonging to one
or the other through the Slice= directive.
Scylla DBAs can benefit from this setup by using the systemd-run
utility to fire ad-hoc commands.
Let's say for example that someone wants to hypothetically run a backup
and transfer files to an external object store like S3, making sure that
the amount of page cache used won't create swap pressure leading to
database timeouts.
One can then run something like:
```
sudo systemd-run --uid=`id -u scylla` --gid=`id -g scylla` -t --slice=scylla-helper.slice /path/to/my/magical_backup_tool
```
(or even better, the backup tool can itself be a systemd timer)
Changes from last version:
- No longer use the CPUQuota
- Minor typo fixes
- postinstall fixup for small machines
Benchmark results:
==================
Test: read from disk, with 100% disk util using a single i3.xlarge (4 vCPUs).
We have to fill the cache as we read, so this should stress CPU, memory and
disk I/O.
cassandra-stress command:
```
cassandra-stress read no-warmup duration=5m -rate threads=20 -node 10.2.209.188 -pop dist=uniform\(1..150000000\)
```
Baseline results:
```
Results:
Op rate : 13,830 op/s [READ: 13,830 op/s]
Partition rate : 13,830 pk/s [READ: 13,830 pk/s]
Row rate : 13,830 row/s [READ: 13,830 row/s]
Latency mean : 1.4 ms [READ: 1.4 ms]
Latency median : 1.4 ms [READ: 1.4 ms]
Latency 95th percentile : 2.4 ms [READ: 2.4 ms]
Latency 99th percentile : 2.8 ms [READ: 2.8 ms]
Latency 99.9th percentile : 3.4 ms [READ: 3.4 ms]
Latency max : 12.0 ms [READ: 12.0 ms]
Total partitions : 4,149,130 [READ: 4,149,130]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
Question 1:
===========
Does putting scylla in a special slice affect its performance ?
Results with Scylla running in a slice:
```
Results:
Op rate : 13,811 op/s [READ: 13,811 op/s]
Partition rate : 13,811 pk/s [READ: 13,811 pk/s]
Row rate : 13,811 row/s [READ: 13,811 row/s]
Latency mean : 1.4 ms [READ: 1.4 ms]
Latency median : 1.4 ms [READ: 1.4 ms]
Latency 95th percentile : 2.2 ms [READ: 2.2 ms]
Latency 99th percentile : 2.6 ms [READ: 2.6 ms]
Latency 99.9th percentile : 3.3 ms [READ: 3.3 ms]
Latency max : 23.2 ms [READ: 23.2 ms]
Total partitions : 4,151,409 [READ: 4,151,409]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
*Conclusion* : No significant change
Question 2:
===========
What happens when there is a CPU hog running in the same server as scylla?
CPU hog:
```
taskset -c 0 /bin/sh -c "while true; do true; done" &
taskset -c 1 /bin/sh -c "while true; do true; done" &
taskset -c 2 /bin/sh -c "while true; do true; done" &
taskset -c 3 /bin/sh -c "while true; do true; done" &
sleep 330
```
Scenario 1: CPU hog runs freely:
```
Results:
Op rate : 2,939 op/s [READ: 2,939 op/s]
Partition rate : 2,939 pk/s [READ: 2,939 pk/s]
Row rate : 2,939 row/s [READ: 2,939 row/s]
Latency mean : 6.8 ms [READ: 6.8 ms]
Latency median : 5.3 ms [READ: 5.3 ms]
Latency 95th percentile : 11.0 ms [READ: 11.0 ms]
Latency 99th percentile : 14.9 ms [READ: 14.9 ms]
Latency 99.9th percentile : 17.1 ms [READ: 17.1 ms]
Latency max : 26.3 ms [READ: 26.3 ms]
Total partitions : 884,460 [READ: 884,460]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
Scenario 2: CPU hog runs inside scylla-helper slice
```
Results:
Op rate : 13,527 op/s [READ: 13,527 op/s]
Partition rate : 13,527 pk/s [READ: 13,527 pk/s]
Row rate : 13,527 row/s [READ: 13,527 row/s]
Latency mean : 1.5 ms [READ: 1.5 ms]
Latency median : 1.4 ms [READ: 1.4 ms]
Latency 95th percentile : 2.4 ms [READ: 2.4 ms]
Latency 99th percentile : 2.9 ms [READ: 2.9 ms]
Latency 99.9th percentile : 3.8 ms [READ: 3.8 ms]
Latency max : 18.7 ms [READ: 18.7 ms]
Total partitions : 4,069,934 [READ: 4,069,934]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
*Conclusion*: With systemd slice we can keep the performance very close to
baseline
Question 3:
===========
What happens when there is a CPU hog running in the same server as scylla?
I/O hog: (Data in the cluster is 2x size of memory)
```
while true; do
find /var/lib/scylla/data -type f -exec grep glauber {} +
done
```
Scenario 1: I/O hog runs freely:
```
Results:
Op rate : 7,680 op/s [READ: 7,680 op/s]
Partition rate : 7,680 pk/s [READ: 7,680 pk/s]
Row rate : 7,680 row/s [READ: 7,680 row/s]
Latency mean : 2.6 ms [READ: 2.6 ms]
Latency median : 1.3 ms [READ: 1.3 ms]
Latency 95th percentile : 7.8 ms [READ: 7.8 ms]
Latency 99th percentile : 10.9 ms [READ: 10.9 ms]
Latency 99.9th percentile : 16.9 ms [READ: 16.9 ms]
Latency max : 40.8 ms [READ: 40.8 ms]
Total partitions : 2,306,723 [READ: 2,306,723]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
Scenario 2: I/O hog runs in the scylla-helper systemd slice:
```
Results:
Op rate : 13,277 op/s [READ: 13,277 op/s]
Partition rate : 13,277 pk/s [READ: 13,277 pk/s]
Row rate : 13,277 row/s [READ: 13,277 row/s]
Latency mean : 1.5 ms [READ: 1.5 ms]
Latency median : 1.4 ms [READ: 1.4 ms]
Latency 95th percentile : 2.4 ms [READ: 2.4 ms]
Latency 99th percentile : 2.9 ms [READ: 2.9 ms]
Latency 99.9th percentile : 3.5 ms [READ: 3.5 ms]
Latency max : 183.4 ms [READ: 183.4 ms]
Total partitions : 3,984,080 [READ: 3,984,080]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
*Conclusion*: With systemd slice we can keep the performance very close to
baseline
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Propagate the abort_source from main() into boot_strapper and range_stream and
check for aborts at strategic points. This includes aborting running stream_plans
and aborting sleeps between retries.
Fixes#4674
In order to propagate stop signals, expose them as sharded<abort_source>. This
allows propagating the signal to all shards, and integrating it with
sleep_abortable().
Because sharded<abort_source>::stop() will block, we'll now require stop_signal
to run in a thread (which is already the case).
"
This is hopefully the last large refactoring on the way of UDF.
In UDF we have to convert internal types to Lua and back. Currently
almost all our types and hidden in types.cc and expose functionality
via virtual functions. While it should be possible to add a
convert_{to|from}_lua virtual functions, that seems like a bad design.
In compilers, the type definition is normally public and different
passes know how to reason about each type. The alias analysis knows
about int and floats, not the other way around.
This patch series is inspired by both the LLVM RTTI
(https://www.llvm.org/docs/HowToSetUpLLVMStyleRTTI.html) and
std::variant.
The series makes the types public, adds a visit function and converts
the various virtual methods to just use visit. As a small example of
why this is useful, it then moves a bit of cql3 and json specific
logic out of types.cc and types.hh. In a similar way, the UDF code
will be able to used visit to convert objects to Lua.
In comparison with the previous versions, this series doesn't require the intermediate step of converting void* to data_value& in a few member functions.
This version also has fewer double dispatches I a am fairly confident has all the tools for avoiding all double dispatches.
"
* 'simplify-types-v3' of https://github.com/espindola/scylla: (80 commits)
types: Move abstract_type visit to a header
types: Move uuid_type_impl to a header
types: Move inet_addr_type_impl to a header
types: Move varint_type_impl to a header
types: Move timeuuid_type_impl to a header
types: Move date_type_impl to a header
types: Move bytes_type_impl to a header
types: Move utf8_type_impl to a header
types: Move ascii_type_impl to a header
types: Move string_type_impl to a header
types: Move time_type_impl to a header
types: Move simple_date_type_impl to a header
types: Move timestamp_type_impl to a header
types: Move duration_type_impl to a header
types: Move decimal_type_impl to a header
types: Move floating point types to a header
types: Move boolean_type_impl to a header
types: Move integer types to a header
types: Move integer_type_impl to a header
types: Move simple_type_impl to a header
...
This testcase was previously commented out, pending a fix that cannot
be made. Currently it is impossible to validate the marker-value type
at filtering time. The value is entered into the options object under
its presumed type of string, regardless of what it was made from.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Somehow this test case sits in the middle of LIKE-operator tests:
test_alter_type_on_compact_storage_with_no_regular_columns_does_not_crash
Move it so LIKE test cases are contiguous.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
There are systemd-related steps done in both rpm and deb builds.
Move that to a script so we avoid duplication.
The tests are so far a bit specific to the distributions, so it
needs to be adapted a bit.
Also note that this also fixes a bug with rpm as a side-effect:
rpm does not call daemon-reload after potentially changing the
systemd files (it is only implied during postun operations, that
happen during uninstall). daemon-reload was called explicitly for
debian packages, and now it is called for both.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The following logic was duplicated:
* For all types, if value is null, the result is zero.
* For non collection types, if the native object is empty, the result
is zero.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We defined less for some types and compare for others. There is no
type for which compare is substantially more expensive, so define it
for all types and implement less with compare.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The type walking is similar to what the find function does, but
refactoring it doesn't seem worth it if these are the only two uses.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch the logic for walking all nested types is moved to a
helper function. It also fixes reversed_type_impl not being handled in
references_duration.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The api is inspired by on std::variant.
This bridges the runtime type of a abstract_type object to a compile
time overload resolution. For example, it is possible to have a single
lambda to visit a string_type_impl, but it corresponds to two leaf
types (ascii and utf8).
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The type hierarchy is closed, so we can give each leaf an enum value.
This will be used to implement a visitor pattern and reduce code
duplication.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The corresponding code is correct, but I noticed no tests would fail
if it was broken while refactoring it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These only use public member functions from data_value, so there is no
reason for not making them public too.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There are two reasons for the move. First is that cql server lifetime
is shorter than storage_proxy one and the later stores references to
the semaphore in each service_permit it holds. Second - we want thrift
(and in the future other user APIs) to share the same admission control
memory pool.
Fixes#4844
Message-Id: <20190814142614.GT17984@scylladb.com>
We in fact do stop cql server in storage_service::drain_on_shutdown()
which is called in main.cc during shutdown.
Message-Id: <20190814085027.GP17984@scylladb.com>
Since we do accept pull requests (in a long-running experiment), the
pull request template suggesting not to use them is inaccurate, and
many requesters forget to remove the boilerplace.
Remove the outdate template.
"Commit e3f7fe4 added file owner validation to prevent Scylla from
crashing when it tries to touch a file it doesn't own. However, under
docker, we cannot expect to pass this check since user IDs are from
different namespaces: the process runs in a container namespace, but the
data files usually come from a mounted volume, and so their uids are
from the host namespace.
So we need to relax the check. We do this by reverting b1226fb, which
causes Scylla to run as euid 0 in docker, and by special-casing euid 0
in the ownership verification step.
Fixes #4823."
* 'docker-euid-0' of git://github.com/avikivity/scylla:
main: relax file ownership checks if running under euid 0
Revert "dist/docker/redhat: change user of scylla services to 'scylla'"
During startup, we check that the data files are owned by our euid.
But in a container environment, this is impossible to enforce because
uid/username mappings are different between the host and the container,
and the data files are likely to be mounted from the host.
To allow for such environments, relax the checks if euid=0. This
both matches what happens in a container (programs run as root) and
the kernel access checks (euid 0 can do anything).
We can reconsider this when container uid mapping is better developed.
Fixes#4823.
Fixes#4536.
This reverts commit b1226fb15a. When the
data volume is mounted from the host (as is usual in container
deployments), we can't expect that the files will be owned by the
in-container scylla user. So that commit didn't really fix#4536.
A follow-up patch will relax the check so it passes in a container
environment.
This adds a '--configure-flags FLAGS' command line option, which
overrides the flags passed to scylla.git 'configure.py' script. We need
this for flexibility of custom builds in Jenkins pipelines, for example.
Message-Id: <20190813095428.13590-1-penberg@scylladb.com>
Make the reader recreation logic more robust, by moving away from
deciding which fragments have to be dropped based on a bunch of
special cases, instead replacing this with a general logic which just
drops all already seen fragments (based on their position). Special
handling is added for the case when the last position is a range
tombstone with a non full prefix starting position. Reproducer unit
tests are added for both cases.
Refs #4695Fixes#4733
We should wait for a future returned from add_local_application_state() to
resolve before issuing new calculation, otherwise two
add_local_application_state() may run simultaneously for the same state.
Fixes#4838.
Message-Id: <20190812082158.GE17984@scylladb.com>
Add `single_fragment_buffer` test variable. When set, the shard readers
are created with a max buffer size of 1, effectively causing them to
read a single fragment at a time. This, when combined with
`evict_readers=true` will stress the recreate reader logic to the max.
Be more tolerant with different but equivalent representation of range
deletions. When expecting a range tombstone, keep reading range
tombstones while these can be merged with the cumulative range
tombstone, resulting from the merging of the previous range tombstones.
This results in tests tolerating range tombstones that are split into
several, potentially overlapping range tombstones, representing the
same underlying deletion.
This is tailored to the multishard_combining_reader, to make sure it
doesn't loos rows following a range tombstone with a prefix starting
position (whose prefix their keys fall into).
Currently the remote reader uses the last seen fragment's position to
calculate the position the reader should continue from when the reader
is recreated after having been evicted. Recently it was discovered that
this logic breaks down badly when this last position is a non-full
clustering prefix (a range tombstone start bound). In this case, if only
the last position is available, there is no good way of computing the
starting position. Starting after this position will potentially miss
any rows that fall into the prefix (the current behaviour). Starting
from before it will cause all range tombstones with said prefix to be
re-emitted, causing other problems. A better solution is to exploit the
fact that sometimes we also know what the next fragment is.
These "some" times are the exact times that are problematic with the
current approach -- when the last fragment is a range tombstone.
Exploiting this extra knowledge allows for a much better way for
calculating the starting position: instead of maintaining the last
position, we maintain the next position, which is always safe to start
from. This is not always possible, but in many cases we can know for
sure what the next position is, for example if the last position was a
static row we can be sure the next position is the first clustering
position (or partition end). In the few cases where we cannot calculate
the next position we fall back to the previous logic and start from
*after* the last positions. The good news is that in these remaining
cases (the last fragment is a clustering row) it is safe to do so.
This patch also does some refactoring of the remote-reader internals,
all fill-buffer related logic is grouped together in a single
`fill_buffer()` method.
Allow expressing `pos` in term of a `position_in_partition_view`, which
allows finer control of the exact position, allowing specifying position
before, at or after a certain key.
The previous overload is kept for backward compatibility, invoking the
new overload behind the curtains.
"
Refs #4726
Implement the api portion of a "describe sstables" command.
Adds rest types for collecting both fixed and dynamic attributes, some grouped. Allows extensions to add attributes as well. (Hint hint)
"
* 'sstabledesc' of https://github.com/elcallio/scylla:
api/storage_service: Add "sstable_info" command
sstables/compress: Make compressor pointer accessible from compression info
sstables.hh: Add attribute description API to file extension
sstables.hh: Add compression component accessor
sstables.hh: Make "has_component" public
Match SQL's LIKE in allowing an empty pattern, which matches only
an empty text field.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This simplifies the debug implementation and it now should work with
scylla-gdb.py.
It is not clear what, if anything, is lost by not using random
ids. They were never being reused in the debug implementation anyway.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190618144755.31212-1-espindola@scylladb.com>
This patch fixes a bug that a map is held on the stack and then is used
by a future.
Instead, the map is now wrapped with do_with.
Fixes#4824
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
After commit 8a0c4d5 (Merge "Repair switch to rpc stream" from Asias),
we increased the row buffer size for repair from 512KiB to 32MiB per
repair instance. We allow repairing 16 ranges (16 repair instance) in
parallel per repair request. So, a node can consume 16 * 32MiB = 512MiB
per user requested repair. In addition, the repair master node can hold
data from all the repair followers, so the memory usage on repair master
can be larger than 512MiB. We need to provide a way to limit the memory
usage.
In this patch, we limit the total memory used by repair to 10% of the
shard memory. The ranges that can be repaired in parallel is:
max_repair_ranges_in_parallel = max_repair_memory / max_repair_memory_per_range.
For example, if each shard has 4096MiB of memory, then we will have
max_repair_ranges_in_parallel = 4096MiB / 32MiB = 12.
Fixes#4675
"
Current admission control takes a permit when cql requests starts and
releases it when reply is sent, but some requests may leave background
work behind after that point (some because there is genuine background
work to do like complete a write or do a read repair, and some because
a read/write may stuck in a queue longer than the request's timeout), so
after Scylla replies with a timeout some resources are still occupied.
The series fixes this by passing the permit down to storage_proxy where
it is held until all background work is completed.
Fixes#4768
"
* 'gleb/admission-v3' of github.com:scylladb/seastar-dev:
transport: add a metric to follow memory available for service permit.
storage_proxy: store a permit in a read executor
storage_proxy: store a permit in a write response handler
Pass service permit to storage_proxy
transport: introduce service_permit class and use it instead of semaphore_units
transport: hold admission a permit until a reply is sent
transport: remove cql server load balancer
A read executor exists until read operation completes in its entirety
so storing a permit there guaranties that it will be freed only after
no background work left for the request on this server.
A write response handler exists until write operation completes in its
entirety so storing a permit there guaranties that it will be freed only
after no background work left for the request on this server.
Current cql transport code acquire a permit before processing a query and
release it when the query gets a reply, but some quires leave work behind.
If the work is allowed to accumulate without any limit a server may
eventually run out of memory. To prevent that the permit system should
account for the background work as well. The patch is a first step in
this direction. It passes a permit down to storage proxy where it will
be later hold by background work.
Manager trims sstables off to allow compaction jobs to proceed in parallel
according to their weights. The problem is that trimming procedure is not
sstable run aware, so it could incorrectly remove only a subset of a sstable
run, leading to partial sstable run compaction.
Compaction of a sstable run could lead to inneficiency because the run structure
would be messed up, affecting all amplification factors, and the same generation
could even end up being compacted twice.
This is fixed by making the trim procedure respect the sstable runs.
Fixes#4773.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190730042023.11351-1-raphaelsc@scylladb.com>
Current code release admission permit to soon. If new requests are
admitted faster than client read replies back reply queue can grow to
be very big. The patch moves service permit release until after a reply
is sent.
Merged patch series from Avi Kivity:
In rare but valid cases (reconciling many tombstones, paging disabled),
a reconciled_result can grow large. This triggers large allocation
warnings. Switch to chunked_vector to avoid the large allocation.
In passing, fix chunked_vector's begin()/end() const correctness, and
add the reverse iterator function family which is needed by the conversion.
Fixes#4780.
Tests: unit (dev)
Commit Summary
utils: chunked_vector: make begin()/end() const correct
utils::chunked_vector: add rbegin() and related iterators
reconcilable_result: use chunked_vector to hold partitions
"
Add listen/rpc "prefer_ipv6" options to DNS lookup of bind addresses for API/rpc/prometheus etc .
Fixes#4751
Adds using a preferred address family to dns name lookups related to
listen address and rpc address, adhering to the respective "prefer" options.
API, prometheus and broadcast address are all considered to be covered by
the "listen_interface_prefer_ipv6" option.
Note: scylla does not yet support actual interface binding, but these
options should apply equally to address name parameters.
Setting a "prefer_ipv6" option automtially enables ipv6 dns family query.
"
* 'calle/ipv6' of https://github.com/elcallio/scylla:
init: Use the "prefer_ipv6" options available for rpc/listen address/interface
inet_address: Add optional "preferred type" to lookup
config: Add rpc_interface_prefer_ipv6 parameter
config: Add listen_interface_perfer_ipv6 parameter
config.cc: Fix enable_ipv6_dns_lookup actual param name
The FIXME was added in the very first commit ("utils: Convert
utils/FBUtilities.java") that introduced the fb_utilities class as a
stub. However, we have long implemented the parts that we actually use,
so drop the FIXME as obsolete. In addition, drop the remaining
uncommented Java code as unused and also obsolete.
Message-Id: <20190808182758.1155-1-penberg@scylladb.com>
The Maven build tool ("mvn"), which is used by scylla-jmx and
scylla-tools-java, stores dependencies in a local repository stored at
$HOME/.m2. Make sure it's accessible to dbuild.
Message-Id: <20190808140216.26141-1-penberg@scylladb.com>
ignore_partial_runs() brings confusion because i__p__r() equal to true
doesn't mean filter out partial runs from compaction. It actually means
not caring about compaction of a partial run.
The logic was wrong because any compaction strategy that chooses not to ignore
partial sstable run[1] would have any fragment composing it incorrectly
becoming a candidate for compaction.
This problem could make compaction include only a subset of fragments composing
the partial run or even make the same fragment be compacted twice due to
parallel compaction.
[1]: partial sstable run is a sstable that is still being generated by
compaction and as a result cannot be selected as candidate whatsoever.
Fix is about making sure partial sstable run has none of its fragments
selected for compaction. And also renaming i__p__r.
Fixes#4729.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190807022814.12567-1-raphaelsc@scylladb.com>
This documents the steps needed to build Scylla's Linux packages with
the relocatable package infrastructure we use today.
Message-Id: <20190807134017.4275-1-penberg@scylladb.com>
Running "dbuild" without a build command fails as follows:
$ ./tools/toolchain/dbuild
Error: This command has to be run under the root user.
Israel Fruchter discovered that the default command of our Docker image is this:
"Cmd": [
"bash",
"-c",
"dnf -y install python3-cassandra-driver && dnf clean all"
]
Let's make "/bin/bash" the default command instead, which will make
"dbuild" with no build command to return to the host shell.
Message-Id: <20190807133955.4202-1-penberg@scylladb.com>
The build_reloc.sh script passes "--with=scylla" and "--with=iotune" to
the configure.py script. This is redundant as the
"scylla-package.tar.gz" target of ninja already limits itself to them.
Removing the "--with" options allows building unit tests after a
relocatable package has been built without having to rebuild anything.
Message-Id: <20190807130505.30089-1-penberg@scylladb.com>
Add a '--mode <mode>' command line option to the 'build_reloc.sh' script
so that we can create relocatable packages for debug builds.
The '--mode' command line option defaults to 'release' so existing users
are unaffected.
Message-Id: <20190807120759.32634-1-penberg@scylladb.com>
Avoid including the lengthy stream_session.hh in messaging_service.
More importantly, fix the build because currently messaging_service.cc
and messaging_service.hh does not include stream_mutation_fragments_cmd.
I am not sure why it builds on my machine. Spotted this when backporting
the "streaming: Send error code from the sender to receiver" to 3.0
branch.
Refs: #4789
The stream close() guarantees the data sent will be flushed. No need to
call the stream flush() since the stream is not reused.
Follow up fix for commit bac987e32a (streaming: Send error code from
the sender to receiver).
Refs #4789
* seastar d199d27681...a1cf07858b (1):
> Merge 'Do not return a variadic future form server_socket::accept()' from Avi
Seastar configure.py now has --api-level=1, to keep us one the old variadic future
server_socket::accept() API.
In case of error on the sender side, the sender does not propagate the
error to the receiver. The sender will close the stream. As a result,
the receiver will get nullopt from the source in
get_next_mutation_fragment and pass mutation_fragment_opt with no value
to the generating_reader. In turn, the generating_reader generates end
of stream. However, the last element that the generating_reader has
generated can be any type of mutation_fragment. This makes the sstable
that consumes the generating_reader violates the mutation_fragment
stream rule.
To fix, we need to propagate the error. However RPC streaming does not
support propagate the error in the framework. User has to send an error
code explicitly.
Fixes: #4789
Fixes#4751
Adds using a preferred address family to dns name lookups related to
listen address and rpc address, adhering to the respective "prefer" options.
API, prometheus and broadcast address are all considered to be covered by
the "listen_interface_prefer_ipv6" option.
Note: scylla does not yet support actual interface binding, but these
options should apply equally to address name parameters.
Setting a "prefer_ipv6" option automtially enables ipv6 dns family query.
Assembles information and attributes of sstables in one or more
column families.
v2:
* Use (not really legal) nested "type" in json
* Rename "table" param to "cf" for consistency
* Some comments on data sizes
* Stream result to avoid huge string allocations on final json
"
This adds the option to compress sstables using the Zstandard algorithm
(https://facebook.github.io/zstd/).
To use, pass 'sstable_compression': 'org.apache.cassandra.io.compress.ZstdCompressor'
to the 'compression' argument when creating a table.
You can also specify a 'compression_level' (default is 3). See Zstd documentation for the available
compression levels.
Resolves#2613.
This PR also fixes a bug in sstables/compress.cc, where chunk length in bytes
was passed to the compressor as chunk length in kilobytes. Fortunately,
none of the compressors implemented until now used this parameter.
Example usage (assuming there exists a keyspace 'a'):
create table a.a (a text primary key, b int) with compression = {'sstable_compression': 'org.apache.cassandra.io.compress.ZstdCompressor', 'compression_level': 1, 'chunk_length_in_kb': '64'};
Notes:
1. The code uses an external dependency: https://github.com/facebook/zstd. Since I'm using "experimental" features of the library (using my own allocated memory to store the compression/decompression contexts), according to the library's documentation we need to link it statically (https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L63). I added a git submodule.
2. The compressor performs some dynamic allocations. Depending on the specified chunk length and/or compression level the allocations might be big and seastar throws warnings. But with reasonable chunk length sizes it should be OK.
3. It doesn't yet provide an option to train it with dictionaries, but that should be easy to add in another commit.
"
* 'zstd' of https://github.com/kbr-/scylla:
Configure: rename seastar_pool to submodule_pool, add more submodules to the pool
Add unit tests for Zstd compression
Enable tests that use compressed sstable files
Add ZStandard compression
Fix the value of the chunk length parameter passed to compressors
The fragment reader calls `fast_forward_to()` from its constructor to
discard fragments that fall outside the query range. Mmove the
the fast-forward code in to an internal void returning method, and call
that from both the constructor and `fast_forward_to()`, to avoid a
warning on a discarded future<>.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190801133942.10744-1-bdenes@scylladb.com>
The files in tests/sstables/3.x/compressed/ were not used in the tests.
This commit:
- renames the directory to tests/sstables/3.x/lz4/,
- adds analogous directories and files for other compressors,
- adds tests using these files,
- does some minor refactoring.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
This adds the option to compress sstables using the Zstandard algorithm
(https://facebook.github.io/zstd/).
To use, pass 'sstable_compression': 'org.apache.cassandra.io.compress.ZstdCompressor'
to the 'compression' argument when creating a table.
You can also specify a 'compression_level'. See Zstd documentation for the available
compression levels.
Resolves#2613.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
This commit also fixes a bug in sstables/compress.cc, where chunk length in bytes
was passed to the compressor as chunk length in kilobytes. Fortunately,
none of the compressors implemented until now used this parameter.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
"
Not emitting partition_end for a partition is incorrect. SStable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.
It's better to catch this problem at the time of writing, and not
generate bad sstables.
Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.
Enabled for both mc and la/ka.
Passing --abort-on-internal-error on the command line will switch to
aborting instead of throwing an exception.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
"
* 'catch-unclosed-partition-sstable-write' of https://github.com/tgrabiec/scylla:
sstables: writer: Validate that partition is closed when the input mutation stream ends
config, exceptions: Add helper for handling internal errors
utils: config_file: Introduce named_value::observe()
When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:
- n1, n2
- n2 is down, n1 think n2 is down
- n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
reads/writes to n2, but n2 hasn't replicated the token_metadata to all
the shards.
- n2 complains:
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
(sorted_tokens is empty in first_token_index!)
The code path looks like below:
0 stoarge_service::init_server
1 prepare_to_join()
2 add gossip application state of NET_VERSION, SCHEMA and so on.
3 _gossiper.start_gossiping().get()
4 join_token_ring()
5 _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6 replicate_to_all_cores().get()
7 storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS
The race talked above is at line 3 and line 6.
To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.
In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.
Tests: update_cluster_layout_tests.py
Fixes: #4709Fixes: #4723
"
Altough 733c68cb1 made sure to synchronize the query time used for
compaction happening in the mutation_source_test suite and that
happening in the `flat_mutation_assertions` class, there remained
another hidden compaction that potentially could use a different
timestamp and hence produce false positive test failures. This was
hastily fixed by cea3338e3, by just increasing the TTL of cells, thus
avoiding possible differences in compaction output. This mini-series is
the proper fix to this problem. It passes a query time to the populate
function, allowing the users of the mutation source test suite to
forward it to any compaction they might be doing on the data. The quick
fix is reverted in favor of the proper fix.
Refs: #4747
"
* 'mutation_source_tests_proper_ttl_fix/v1' of https://github.com/denesb/scylla:
Revert "tests/mutation_source_tests: generate_mutation_sets() use larger ttl"
tests/sstable_mutation_test: test_sstable_conforms_to_mutation_source: use query_time
tests/mutation_source_test: add populate_fn overload with query_time
Not emitting partition_end for a partition is incorrect. Sstable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will may result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.
It's better to catch this problem at the time of writing, and not
generate bad sstables.
Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.
Enabled for both mc and la/ka.
The handler is intended to be called when internal invariants are
violated and the operation cannot safely continue. The handler either
throws (default) or aborts, depending on configuration option.
Passing --abort-on-internal-error on the command line will switch to
aborting.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
Usually, a reconcilable_result holds very few partitions (1 is common),
since the page size is limited by 1MB. But if we have paging disabled or
if we are reconciling a range full of tombstones, we may see many more.
This can cause large allocations.
Change to chunked_vector to prevent those large allocations, as they
can be quite expensive.
Fixes#4780.
begin() of a const vector should return a const_iterator, to avoid
giving the caller the ability to mutate it.
This slipped through since iterator's constructor does a const_cast.
Noticed by code inspection.
Use the query_time passed in to the populate function and forward it to
the sstable constructor, so that the compaction happening during sstable
write uses the same query time that any compaction done by the mutation
source test suit does.
So tests that do compaction can pass the query_time they used for it to
clients that do some compaction themselves, making sure all compactions
happen with the same query time, avoiding false positive test failures.
Don't let perftune.py print false alarm error message when we calculate
a compute CPU set for tuning modes.
This may happen when we calculate a CPU set for non-MQ tuning modes on
small systems on which these modes are forbidden because they would
result in a zero CPU set, e.g. sq_split on a system with a single
physical core.
We are going to utilize a newly introduced --get-cpu-mask-quiet execution
mode introduced to the seastar/script/perftune.py by the
"perftune.py: introduce --get-cpu-mask-quiet" series which would return
a zero CPU set if that's what it turns out to be instead of exiting with
an error what --get-cpu-mask would do in such a case.
The rest of scylla_util.py logic is going to handle a zero CPU set
returned by get_mode_cpuset() correctly.
Fixes#4211Fixes#4443
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190731212901.9510-1-vladz@scylladb.com>
"
A fix for #4338 "storage_proxy add a counter for cql requests
that arrived to a non replica"
Such requests should be tracked since forwarding them to a correct
replica can create a lot network noise and incur significant performance
penalty.
The current metrics are considered insufficient after introduction
of heat-weighted load balancing.
"
Fixes#4388.
* 'gh-4338' of https://github.com/kostja/scylla:
metrics: introduce a metric for non-local reads
metrics: account writes forwarded by a coordinator in an own metric.
"
Fixes#4663.
Fixes#4718.
"
* 'make_cleanup_run_aware_v3' of https://github.com/raphaelsc/scylla:
tests/sstable_datafile_test: Check cleaned sstable is generated with expected run id
table: Make SSTable cleanup run aware
compaction: introduce constants for compaction descriptor
compaction: Make it possible to config the identifier of the output sstable run
table: do not rely on undefined behavior in cleanup_sstables
* seastar 3f88e9068b...d90834443c (12):
> Print warning when somaxconn lower than backlog parameter used for listen()
> Merge "perftune.py: introduce --get-cpu-mask-quiet" from Vlad
> seastar-json2code: Handle "$ref"-usage for nested object types properly
> Make future [[nodiscard]]
> Allow pass listen_options to http_server::listen
> Handle EPOLLHUP and EPOLLERR from epoll explicitly
> reactor: fix false positives in the stall detector due to large task queue
> Merge "Small asan related improvements" from Rafael
> thread: reduce allocations during context switch
> thread: remove deprecated thread_scheduling_group and its unit test
> reactor: make _polls to be non atomic
> reactor: remove unused _tasks_processed variable
On previous commit ac9b115a8f, install.sh requires to specify single package using --pkg, there is no way to select all.
It should be select all packages when running install.sh without --pkg.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190731013245.5857-1-syuu@scylladb.com>
get_rpc_client assumes the messaging_service is not stopped. We should check
is_stopping() before we call get_rpc_client.
We do such check in existing code, e.g., send_message and friends. Do
the same check in the newly introduced
make_sink_and_source_for_stream_mutation_fragments() and friends for row
level repair.
Fixes: #4767
"
This series is a batch of first small steps towards devirtualizing CQL restrictions:
- one artificial parent class in the hierarchy is removed: abstract_restriction
- the following functions are devirtualized:
* is_EQ()
* is_IN()
* is_slice()
* is_contains()
* is_LIKE()
* is_on_token()
* is_multi_column()
Future steps can involve the following:
- introducing a std::variant of restriction targets: it's either a column
or a vector of columns
- introducing a std::variant of restriction values: it's one of:
{term, term_slice, std::vector<term>, abstract_marker}
The steps above will allow devirtualizing most of the remaining virtual functions
in favor of std::visit. They will also reduce the number of subclasses,
e.g. what's currently `token_restriction::IN_with_values` can be just an instance
of `restriction`, knowing that it's on a token, having a target of std::vector<column>
and a value of std::vector<term>.
Tests: unit(dev), dtest: cql_tests, cql_additional_tests
"
* 'refactor_restrictions_2' of https://github.com/psarna/scylla:
cql3: devirtualize is_on_token()
cql3: devirtualize is_multi_column()
cql3: devirtualize is_EQ, is_IN, is_contains, is_slice, is_LIKE
tests: add enum_set adding case
cql3: allow adding enum_sets
cql3: remove abstract_restriction class
(At least) on Ubuntu 19 'thrift -version' prints the expected
string but its exit status is non-zero:
$ thrift -version
Thrift version 0.9.1
$ echo $?
1
We don't really care about the exit status but rather about the printed
version string. If there is going to be some problem with the command,
e.g. it's missing, the printed string is not going to be as expected
anyway - let's verify that explicitly by checking the format of the
returned string in that case.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190722211729.24225-1-vladz@scylladb.com>
Currently all cells generated by this method uses a ttl of 1. This
causes test flakyness as tests often compact the input and output
mutations to weed out artificial differences between them. If this
compaction is not done with the exact same query time, then some cells
will be expired in one compaction but not in the other.
733c68cb1 attempted to solve this by passing the same query time to
`flat_mutation_reader_assertions::produce_compacted()` as well as
`mutation_partition::compact_for_query()` when compacting the input
mutation. However a hidden compaction spot remained: the ka/la sstable
writer also does some compaction, and for this it uses the time point
passed to the `sstable` constructor, which defaults to
`gc_clock::now()`. This leads to false positive failures in
`sstable_mutation_test.cc`.
At this point I don't know what the original intent was behind this low
`ttl` value. To solve the immediate problem of the tests failing, I
increased it. If it turns out that this `ttl` value has a good reason,
we can do a more involved fix, of making sure all sstables written also
get the same query time as that used for the compaction.
Fixes: #4747
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190731081522.22915-1-bdenes@scylladb.com>
All restrictions inherit from `abstract_restriction` class,
which has only one parent class: `restriction`. To simplify the
inheritance tree, `restriction` and `abstract_restriction`
are merged into one class named `restriction`.
`produces_compacted()` is usually used in tandem of another
compaction done on the expected output (`m` param). This is usually done
so that even though the reader works with an uncompacted stream, when
checking the checking of the result will not fail due to insignificant
changes to the data, e.g. expired collection cells dropped while merging
two collections. Currently, the two compactions, the one inside
`produce_compacted()` and the one done by the caller uses two separate
calls to `gc_clock::now()` to obtain the query time. This can lead to
off-by-one errors in the two query times and subsequently artificial
differences between the two compacted mutations, ultimately failing the
test due to a false-positive.
To prevent this allow callers to pass in a query time, the same they
used to compact the input mutation (`m`).
This solves another source of flakyness in unit tests using the mutation
source test suite.
Refs: #4695Fixes: #4747
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190726144032.3411-1-bdenes@scylladb.com>
If we had an error while reading, then we would have failed to close
the reader, which in turn can cause memory corruption. Make the
closing more robust by using then_wrapped (that doesn't skip on
exception) and log the error for analysis.
Fixes#4761.
* seastar c1be3c912f...3f88e9068b (3):
> reactor: improve handling of connect storms
> json: Make date formatter use RFC8601/RFC3339 format
> reactor: fix deadlock of stall detector vs dlopen
Fixes#4759.
Currently, install.sh just used for building .rpm, we have similar build script
under dist/debian, sometimes it become inconsistent with install.sh.
Since most of package build process are same, we should share install.sh on both
.rpm and .deb package build process.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190725123207.2326-1-syuu@scylladb.com>
Across all calls to `make_fragments_with_non_monotonic_positions()`, to
prevent off-by one errors between the separately generated test input
and expected output. This problem was already supposed to be fixed by
5f22771ea8 but for some reason that only
used the same deletion time *inside* a single call, which will still
fall short in some cases.
This should hopefully fix this problem for good.
Refs: #4695
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190724073240.125975-1-bdenes@scylladb.com>
If the user specifies an output file name using "--xunit=<filename>",
test.py will write the test results of non-boost tests to the file in the XUnit XML format.
Every boost test creates its own results file already.
Resolves#4680.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
Explicity pass the port number of the AWS metadata server API
when creating a corresponding socket.
This patch fixes the regression introduced by 4ef940169f.
Fixes#4719
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
compaction: allow collecting purged data
Allow the compaction initiator to pass an additional consumer that will
consume any data that is purged during the compaction process. This
allows the separate retention of these dead cells and tombstone until
some long-running process like compaction safely finishes. If the
process fails or is interrupted the purged data can be used to prevent
data resurrection.
This patch was developed to serve as the basis for a solution to #4531
but it is not a complete solution in and on itself.
This series is a continuation of the patch: "[PATCH v1 1/3] Introduce
Garbage Collected Consumer to Mutation Compactor" by Raphael S.
Carvalho <raphaelsc@scylladb.com>.
Refs: #4531
* https://github.com/denesb/scylla.git compaction_collect_purged_data/v8:
Introduce compaction_garbage_collector interface
collection_type_impl::mutation: compact_and_expire() add collector
parameter
row: add garbage_collector
row_marker: de-inline compact_and_expire()
row_marker: add garbage_collector
Introduce Garbage Collected Consumer to Mutation Compactor
tests: mutation_writer_test.cc/generate_mutations() ->
random_schema.hh/generate_random_mutations()
tests/random_schema: generate_random_mutations(): remove `engine`
parameter
tests/random_schema: add assert to make_clustering_key()
tests/random_schema: generate_random_mutations(): allow customizing
generated data
tests: random_schema: futurize generate_random_mutations()
random_schema: generate_random_mutations(): restore indentation
data_model: extend ttl and expiry support
tests/random_schema: generate_random_mutations(): generate partition
tombstone
random_schema: add ttl and expiry support
tests/random: add get_bool() overload with random engine param
random_schema: generate_random_mutations(): ensure partitions are
unique
tests: add unit tests for the data stream split in compaction
"
After switching to rpc stream interface, we increased the row buffer
size. Code works on the buffer that do not yield can stall the reactor.
This series fixes the issue by futurizing or running the code in thread
and yield.
Fixes: #4642
"
* 'repair_switch_to_rpc_stream_fix_stall' of https://github.com/asias/scylla:
repair: Enable rpc stream in row level repair
repair: Wrap with foreign_ptr to avoid cross cpu free
repair: Futurize get_repair_rows_size and row_buf_size
repair: Avoid calling get_repair_rows_size in get_sync_boundary
repair: Futurize row_buf_csum
repair: Yield inside get_set_diff
repair: Use get_repair_rows_size helper in get_sync_boundary
repair: Avoid stall in do_estimate_partitions_on_local_shard
remove get_row_diff
repair: Futurize get_row_diff to avoid stall
repair: Fix possible stall in request_row_hashes
repair: Allow default construct for repair_row
repair: Remove apply_rows
repair: Run get_row_diff_with_rpc_stream in a thread
repair: Run get_row_diff_and_update_peer_row_hash_sets inside a thread
repair: Run get_row_diff inside a thread
repair: Add apply_rows_on_master_in_thread
repair: Add apply_rows_on_follower
repair: Futurize working_row_hashes
repair: Remove get_full_row_hashes helper
Data listener reads are implemented as flat_mutation_readers, which
take a reference to the listener and then execute asynchronously.
The listener can be removed between the time when the reference is
taken and actual execution, resulting in a dangling pointer
dereference.
Fix by using a weak_ptr to avoid writing to a destroyed object. Note that writes
don't need protection because they execute atomically.
Fixes#4661.
Tests: unit (dev)
"
Issue #4558 describes a situation in which failure to execute clearsnapshots
will hard crash the node. The problem is that clearsnapshots will internally use
lister::rmdir, which in turn has two in-tree users: clearing snapshots and clearing
temporary directories during sstable creation. The way it is currently
coded, it wraps the io functions in io_check, which means that failures
to remove the directory will crash the database.
We recently saw how benign failures crashed a database during
clearsnapshot: we had snapshot creation running in parallel, adding more
files to the directory that wasn't empty by the time of deletion. I
have also seen very often users add files to existing directories by
accident, which is another possibility to trigger that.
This patch removes the io_check from lister, and moves it to the caller
in which we want to be more strict. We still want to be strict about
the creation of temporary directories, since users shouldn't be touching
that in any way.
Also while working on that, I realized we have no tests for snapshots of
any kind in tree, so let's write some
"
* 'snapshots' of https://github.com/glommer/scylla:
tests: add tests for snapshots.
lister: don't crash the node on failure to remove snapshot
We were using segment::_closed to decide whether _file was already
closed. Unfortunately they are not exactly the same thing. As far as
I understand it, segments can be closed and reused without actually
closing the file.
Found with a seastar patch that asserts on destroying an open
append_challenged_posix_file_impl.
Fixes#4745.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190721171332.7995-1-espindola@scylladb.com>
Compilation fails with fmt release 5.3.0 when we print a bytes_view
using "{}" formatter.
Compiler's complain is: "error: static assertion failed: mismatch
between char-types of context and argument"
Fix this by explicitly using to_hex() converter.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190716221231.22605-3-vladz@scylladb.com>
Merged patch series by Piotr Sarna:
This series introduces the concept of "computed" column, which represents
values not provided directly by the user, but computed on the fly -
possibly using other column values. It will be used in the future to
implement map value indexing, collection indexing, etc. Right now the only
use is the token column for secondary indexes - which is a column computed
from the base partition key value.
After this series, another one that depends on it and adds map value
indexing will be pushed.
Tests: unit(dev)
Piotr Sarna (14):
schema: add computed info to column definition
schema: add implementation of computing token column
schema: allow marking columns as computed in schema builder
service: add computed columns feature
view: check for computed columns in view
view: remove unused token_for function
database: add fixing previous secondary index schemas
tests: disable computed columns feature in schema change test
tests: add schema change test regeneration comment
db: add system_schema.computed_columns
docs: init system_schema_keyspace.md with column computations
tests: generate new test case for schema change + computed cols
index: mark token column as 'computed' when creating mv
tests: add checking computed columns in SI
column_computation.hh | 63 ++++++++
db/schema_features.hh | 4 +-
db/schema_tables.hh | 4 +
idl/frozen_schema.idl.hh | 1 +
schema.hh | 40 +++++
schema_builder.hh | 4 +-
schema_mutations.hh | 18 ++-
service/storage_service.hh | 8 +
view_info.hh | 2 -
database.cc | 6 +-
db/schema_tables.cc | 146 ++++++++++++++++--
db/view/view.cc | 46 +++---
index/secondary_index_manager.cc | 2 +-
schema.cc | 58 ++++++-
schema_mutations.cc | 14 +-
service/storage_service.cc | 5 +
tests/schema_change_test.cc | 63 ++++++--
tests/secondary_index_test.cc | 28 ++++
docs/system_schema_keyspace.md | 40 +++++
plus about 200 new test sstable files
The original "test_schema_digest_does_not_change" test case ensures
that schema digests will match for older nodes that do not support
all the features yet (including computed columns).
The additional case uses sstables generated after computed columns
are allowed, in order to make sure that the digest computed
including computed columns does not change spuriously as well.
Schema change test might need regenerating every time a system table
is added. In order to save future developer's time on debugging this
test, a short description of that requirement is added.
In order to make sure that old schema digest is not recomputed
and can be verified - computed columns feature is initially disabled
in schema_change_test.
The reason for that is as follows: running CQL test env assumes that
we are running the newest cluster with all features enabled. However,
the mere existence of some features might influence digest calculation.
So, in order for the existing test to work correctly, it should have
exactly the same set of cluster supported features as it had during
its creation. It used to be "all features", but now it's "all features
except computed columns". One can think of that as running a cluster
with some nodes not yet knowing what computed columns are, so they
are not taken into account when computing digests.
Additionally, a separate test case that takes computed column digest
into account will be generated and added in this series.
If a schema was created before computed columns were implemented,
its token column may not have been marked as computed.
To remedy this, if no computed column is found, the schema
will be recreated.
The code will work correctly even without this patch in order to support
upgrading from legacy versions, but it's still important: it transforms
token columns from the legacy format to new computed format, which will
eventually (after a few release cycles) allow dropping the support for
legacy format altogether.
Computed columns feature should be checked before creating
index schemas the new way - by adding computed column names
to system_schema.computed_columns.
Some columns may represent not user-provided values, but ones computed
from other columns. Currently an example is token column used in secondary
indexes to provide proper ordering. In order to avoid hardcoding special
cases in execution stage, optional additional information for computed
columns is stored in column definition.
streaming_reader_lifecycle_policy::create_reader() was ignoring the
partition_slice passed to it and always creating the reader for the
full slice.
That's wrong because create_reader() is called when recreating a
reader after it's evicted. If the reader stopped in the middle of
partition we need to start from that point. Otherwise, fragments in
the mutation stream will appear duplicated or out of ordre, violating
assumptions of the consumers.
This was observed to result in repair writing incorrect sstables with
duplicated clustering rows, which results in
malformed_sstable_exception on read from those sstables.
Fixes#4659.
In v2:
- Added an overload without partition_slice to avoid changing existing users which never slice
Tests:
- unit (dev)
- manual (3 node ccm + repair)
Backport: 3.1
Reviewd-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1563451506-8871-1-git-send-email-tgrabiec@scylladb.com>
Given a list of ranges to stream, stream_transfer_task will create an
reader with the ranges and create a rpc stream connection on all the shards.
When user provides ranges to repair with -st -et options, e.g.,
using scylla-manger, such ranges can belong to only one shard, repair
will pass such ranges to streaming.
As a result, only one shard will have data to send while the rpc stream
connections are created on all the shards, which can cause the kernel
run out of ports in some systems.
To mitigate the problem, do not open the connection if the ranges do not
belong to the shard at all.
Refs: #4708
"
Fix another source of flakyness in mutation_reader_test. This one is caused by storage_service_for_tests lacking a config::broadcast_to_all_shards() call, triggering an invalid memory access (or SEGFAULT) when run on more than one shards.
Refs: #4695
"
* 'fix_storage_service_for_tests' of https://github.com/denesb/scylla:
tests: storage_service_for_tests: broadcast config to all shards
tests: move storage_service_for_tests impl to test_services.cc
The function announce_column_family_drop() drops (deletes) a base table
and all the materialized-views used for its secondary indexes, but not
other materialized views - if there are any, the operation refuses to
continue. This is exactly what CQL's "DROP TABLE" needs, because it is
not allowed to drop a table before manually dropping its views.
But there is no inherent reason why it we can't support an operation
to delete a table and *all* its views - not just those related to indexes.
This patch adds such an option to announce_column_family_drop().
This option is not used by the existing CQL layer, but can be used
by other code automating operations programatically without CQL.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190716150559.11806-1-nyh@scylladb.com>
Currently scylla-server.service uses DefaultTimeoutStopSec = 90, if Scylla
does not able to clean-shutdown in 90sec we may have data corruption on the node.
Since we already set TimeoutStartSec = 900, we can use TimeoutSec to set both
TimeoutStartSec and TimeoutStopSec to 900.
See #4700
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190717095416.10652-1-syuu@scylladb.com>
All statement objects which derive from cf_statement, including
drop_index_statement, have a column_family() returning the name of the
column family involved in this statement. For most statement this is
known at the time of construction, because it is part of the statement,
but for "DROP INDEX", the user doesn't specify the table's name - just
the index name. So we need to override column_family() to find the
table name.
The existing implementation assert()ed that we can always find such
a table, but this is not true - for example, in a DROP INDEX with
"IF EXISTS", it is perfectly fine for no such table to exist. In this
case we don't want a crash, and not even an except - it's fine that
we just return an empty table name.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190716180104.15985-1-nyh@scylladb.com>
While inspecting the snapshot code, I realized that we don't have any
tests for it. So I decided to add some.
Unfortunately I couldn't come up with a test of clearsnapshot reliably
failing to remove the directory: relying on create snapshot +
clearsnapshot is racy (not always happen), and other tricks that can be
used to reproduce this -- like creating a root-owned file inside the
snapshots directory -- is environment-dependent, and a bit ugly for unit
tests. Dtests would probably be a better place for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
lister::rmdir has two in-tree users: clearing snapshots and clearing
temporary directories during sstable creation. The way it is currently
coded, it wraps the io functions in io_check, which means that failures
to remove the directory will crash the database.
We recently saw how benign failures crashed a database during
clearsnapshot: we had snapshot creation running in parallel, adding more
files to the directory that wasn't empty by the time of deletion. I
have also seen very often users add files to existing directories by
accident, which is another possibility to trigger that.
This patch removes the io_check from lister, and moves it to the caller
in which we want to be more strict. We still want to be strict about
the creation of temporary directories, since users shouldn't be touching
that in any way.
Fixes#4558
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
disable_sstable_write needs to acquire _sstable_deletion_sem to properly synchronize
with background deletions done by on_compaction_completion to ensure no sstables will
be created or deleted during reshuffle_sstables after
storage_service::load_new_sstables disables sstable writes.
Fixes#4622
Test: unit(dev), nodetool_additional_test.py migration_test.py
"
* 'scylla-4622-fix-disable-sstable-write' of https://github.com/bhalevy/scylla:
table: document _sstables_lock/_sstable_deletion_sem locking order
table: disable_sstable_write: acquire _sstable_deletion_sem
table: uninline enable_sstable_write
table: reshuffle_sstables: add log message
If a node is a seed node, it can not be started with
replace-address-first-boot or the replace-address flag.
The issue is that as a seed node it will generate new tokens instead of
replacing the existing one the user expect it to replaec when supplying
the flags.
This patch will throw a bad_configuration_error exception
in this case.
Fixes#3889
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fixes#4717
Bug in ipv6 support series caused inet_address serialization
to include an additional "size" parameter in the address chunk.
Message-Id: <20190716134254.20708-1-calle@scylladb.com>
Due to recent changes to the config subsystem, configuration has to be
broadcast to all shards if one wishes to use it on them. The
`storage_service_for_tests` has a `sharded<gms::gossiper>` member, which
reads config values on initialization on each shard, causing a crash as
the configuration was initialized only on shard 0. Add a call to
`config::broadcast_to_all_shards()` to ensure all shards have access to
valid config values.
The cleanup procedure will move any sstable out of its sstable run
because sstables are cleaned up individually and they end up receiving
a new run identifier, meaning a table may potentially end up with a
new sstable run for each of the sstables cleaned.
SStable cleanup needs to be run aware, so that the run structure is
not messed up after the operation is done. Given that only one fragment
or other, composing a sstable run, may need cleanup, it's better to keep
them in their original sstable run.
Fixes#4663.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Make it easier for users, and also avoid duplicating knowledge
about descriptor defaults across the codebase.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.
Currently the
test_multishard_combining_reader_non_strictly_monotonic_positions is
flaky. The test is somewhat unconventional, in that it doesn't use the
same instance of data as the input to the test and as it's expected
output, instead it invokes the method which generates this data
(`make_fragments_with_non_monotonic_positions()`) twice, first to
generate the input, and a secondly to generate the expected output. This
means that the test is prone to any deviation in the data generated by
said method. One such deviation, discovered recently, is that the method
doesn't explicitly specify the deletion time of the generated range
tombstones. This results in this deletion time sometimes differing
between the test input and the expected output. Solve by explicitly
passing the same deletion time to all created range tombstones.
Refs: #4695
Fixes a segfault when querying for an empty keyspace.
Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.
Fixes#4689.
Start n1, n2
Create ks with rf = 2
Run repair on n2
Stop n2 in the middle of repair
n1 will notice n2 is DOWN, gossip handler will remove repair instance
with n2 which calls remove_repair_meta().
Inside remove_repair_meta(), we have:
```
1 return parallel_for_each(*repair_metas, [repair_metas] (auto& rm) {
2 return rm->stop();
3 }).then([repair_metas, from] {
4 rlogger.debug("Removed all repair_meta for single node {}", from);
5 });
```
Since 3.1, we start 16 repair instances in parallel which will create 16
readers.The reader semaphore is 10.
At line 2, it calls
```
6 future<> stop() {
7 auto gate_future = _gate.close();
8 auto writer_future = _repair_writer.wait_for_writer_done();
9 return when_all_succeed(std::move(gate_future), std::move(writer_future));
10 }
```
The gate protects the reader to read data from disk:
```
11 with_gate(_gate, [] {
12 read_rows_from_disk
13 return _repair_reader.read_mutation_fragment() --> calls reader() to read data
14 })
```
So line 7 won't return until all the 16 readers return from the call of
reader().
The problem is, the reader won't release the reader semaphore until the
reader is destroyed!
So, even if 10 out of the 16 readers have finished reading, they won't
release the semaphore. As a result, the stop() hangs forever.
To fix in short term, we can delete the reader, aka, drop the the
repair_meta object once it is stopped.
Refs: #4693
Duplicate partitions can appear as a result of the same partition key
generated more than once. For now we simply remove any duplicates. This
means that in some circumstances there will be less partitions generated
than asked.
When generating data, the user can now also generate ttls and
expiry for all generated atoms. This happens in a controlled way, via a
generator functor, very similar to how the timestamps are generated.
This functor is also used by `random_schema` to generate `deletion_time`
for all tombstones, so the user now has full control of when all of the
atoms can be GC'd.
Use an internally create instance of random engine. Passing a readily
seeded engine from the outside is pointless now that we have a mechanism
to seed entire test suites with a command line algorithm: the internal
engine is seeded from tests::random, so the seed of the test suite
determines the internal seed as well.
Update the sole user of this method (mutation_writer_test.cc) to not
generate local seeds anymore.
Introduce consumer in mutation compactor that will only consume
data that is purged away from regular consumer. The goal is to
allow compaction implementation to do whatever it wants with
the garbage collected data, like saving it for preventing
data resurrection from ever happening, like described in
issue #4531.
noop_compacted_fragments_consumer is made available for users that
don't need this capability.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
the row_marker when it expired and would be discarded.
The collector param is optional and defaults to nullptr.
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
all atoms that are expired and can would be discarded. The body of
`compact_and_expire()` was changed so that it checks cells' tombstone
coverage before it checks their expiry, so that cells that are both
covered by a tombstone and also expired are not passed to the collector.
The collector is forwarded to
`collection_type_impl::mutation::compact_and_expire()` as well.
The collector param is optional and defaults to nullptr
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
all atoms that are expired and would be discarded. The body of
`compact_and_expire()` was changed so that it checks cells' tombstone
coverage before it checks their expiry, so that cells that are both
covered by a tombstone and also expired are not passed to the collector.
The collector param is optional and defaults to nullptr. To accommodate
the collector, which needs to know the column id, a new `column_id`
parameter was added as well.
Fixes#4713
Modifying config files to use sharded storage misses the fact
that extensions are allowed to add non-member config fields to
the main configuration, typically from "extra" config_file
objects.
Unless those "extra" files are broadcast when main file broadcast,
the values will not be readable from other shards.
This patch propagates the broadcast to all other config files
whose entries are in the top level object. This ensures we
always keep data up to date on config reload.
Message-Id: <20190715135851.19948-1-calle@scylladb.com>
configure.py currently takes some time to write build.ninja. If the user
interrupts (e.g., control-C) configure.py, it can leave behind a partial
or even empty build.ninja file. This is most frustrating when the user
didn't explicitly run "configure.py", but rather just ran "ninja" and
ninja decided to run configure.py, and after interrupting it the user
cannot run "ninja" again because build.ninja is gone. Another result of
losing build.ninja is that the user now needs to remember which parameters
to run "configure.py", because the old ones stored in build.ninja were lost.
The solution in this patch is simple: We write the new build.ninja contents
into a temporary file, not directly into build.ninja. Then, only when the
entire file has been succesfully written, do we rename the temporary file
to its intended name - build.ninja.
Fixes#4706
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190715122129.16033-1-nyh@scylladb.com>
This interface can be used to implement a garbage collector that
collects atoms that are purged due to expiry during compaction.
The intended usage is collecting purged atoms for safekeeping until the
compaction process finishes safely, to be dropped only at the end when
the compaction is known to have finished successfully.
When scylla is started for the first time with PasswordAuthenticator
enabled, it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. It happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password salted hash value is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager completes it initialization, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished its initialization.
Tests:
1. Unit tests (release)
2. Auth and cqlsh auth related dtests.
Fixes#4226
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190714124839.8392-1-eliransin@scylladb.com>
In scylla-debuginfo package, we have /usr/lib/debug/opt/scylladb/libreloc/libthread_db-1.0.so-666.development-0.20190711.73a1978fb.el7.x86_64.debug
but we actually does not have libthread_db.so.1 in /opt/scylladb/libreloc
since it's not available on ldd result with scylla binary.
To debug thread, we need to add the library in a relocatable package manually.
Fixes#4673
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190711111058.7454-1-syuu@scylladb.com>
The resource manager is used to manage common resources between
various hints managers. In-flight hints used to be one of the shared
resources, but it proves to cause starvation, when one manager eats
the whole limit - which may be especially painful if the background
materialized views hints manager starves the regular hints manager,
which can in turn start failing user writes because of admission control.
This patch makes the limit per-manager again,
which effectively reverts the limit to its original behavior.
Fixes#4483
Message-Id: <8498768e8bccbfa238e6a021f51ec0fa0bf3f7f9.1559649491.git.sarna@scylladb.com>
We were missing calls to underlying_type in a few locations and so the
insert would think the given literal was invalid and the select would
refuse to fetch a UDT field.
Fixes#4672
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190708200516.59841-1-espindola@scylladb.com>
Queries to system.size_estimates table which are not single parition queries
caused Scylla to go into an infinite loop inside multishard_combining_reader::fill_buffer.
This happened because multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for size_estimates_mutation_reader.
This commit fixes the issue and closes#4689.
Move the implementation of size_estimates_mutation_reader
to a separate compilation unit to speed up compilation times
and increase readability.
Refactor tests to use seastar::thread.
`disable_sstable_write` needs to acquire `_sstable_deletion_sem`
to properly synchronize with background deletions done by
`on_compaction_completion` to ensure no sstables will be created
or deleted during `reshuffle_sstables` after
`storage_service::load_new_sstables` disables sstable writes.
Fixes#4622
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the row_level_diff_detect_algorithm::send_full_set_rpc_stream as
supported algo. If both repair master and followers support it, the
master will use the rpc stream interface, otherwise use the old rpc verb
interface.
The moved set_diff and rows will be freed on the target cpu instead of the
source cpu, which will cause a lot of cross-cpu frees.
To fix, wrap them in foreign_ptr.
This fixes a possible cause of #4614.
From the backtrace in that issue, it looks like a file is being closed
twice. The first point in the backtrace where that seems likely is in
the MC writer.
My first idea was to add a writer::close and make it the responsibility
of the code using the writer to call it. That way we would move work
out of the destructor.
That is a bit hard since the writer is destroyed from
flat_mutation_reader::impl::~consumer_adapter and that would need to
get a close function too.
This patch instead just fixes an exception safety issue. If
_index_writer->close() throws, _index_writer is still valid and
~writer will try to close it again.
If the exception was thrown after _completed.set_value(), that would
explain the assert about _completed.set_value() being called twice.
With this patch the path outside of the destructor now moves the
writer to a local variable before trying to close it.
Fixes#4614
Message-Id: <20190710171747.27337-1-espindola@scylladb.com>
In debug mode the LSA needs objects to be 8-byte aligned in order to
maximise coverage from the AddressSanitizer.
Usually `close_active()` creates a dummy objects that covers the end of
the segment being closed. However, it the last real objects ends in the
last eight bytes of the segment then that dummy won't be created because
of the alignment requirements. This broke exit conditions on loops
trying to read all objects in the segment and caused them to attempt to
dereference address at the end of the segment. This patch fixes that.
Fixes#4653.
"
If the user creates a keyspace with the 'SimpleStrategy' replication class
in a multi-datacenter environment, they will receive a warning in the CQL shell
and in the server logs.
Resolves#4481 and #4651.
"
* 'multidc' of https://github.com/kbr-/scylla:
Warn user about using SimpleStrategy with Multi DC deployment
Add warning support to the CQL binary protocol implementation
"
Fixes#2027
Modifies inet address type in scylla to use seastar::net::inet_address,
and removes explicit use of ipv4_addr in various network code in favour
of socket_address. Thus capable of resolving and binding to ipv6.
Adds config option to enable/disable ipv6 (default enabled), so
upgrading cluster can continue to work while running mixed version
nodes (since gossip message address serialization becomes different).
"
* 'calle/ipv6' of https://github.com/elcallio/scylla:
test-serialization: Add small roundtrip test for inet address (v4 + v6)
inet_address/init: Make ipv6 default enabled
db::config: Add enable ipv6 switch (default off)
gms::inet_address: Make serialization ipv6 aware
Remove usage of inet_address::raw_addr()
Replace use of "ipv4_addr" with socket_address
inet_address: Add optional family to lookup
gms::inet_address: Change inet_address to wrap actual seastar::net::inet_address
types: Add ipv6_address support
Post commit b3adabda2d
(Reduce logalloc differences between debug and release)
logalloc_test's memory footprint has grown, in particular
in test_region_groups, and it triggers the oom killer on
our test automation machines.
This patch scales down this test case so it requires less memory.
Fixes#4669
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This adds a '--repeat N' command line option to test.py, which can be
used to execute the tests N times. This is useful for finding flakey
tests, for example.
Message-Id: <20190710092115.15960-1-penberg@scylladb.com>
Currently the handling of partition tombstones is broken in multiple
ways:
* The partition-tombstone is lost when the bucket is calculated for its
timestamp (due to a misplaced `std::exchange()`).
* When the `partition_start` fragment (containing the partition
tombstone) is actually written to the bucket we emit another
`partition_start` fragment before it because the bucket has not seen
that partition before and we fail to notice that we are actually writing
the partition header.
This bug was allowed to fly under the radar because the unit test was
accidentally not creating partition tombstones in the generated data
(due to a mistake). It was discovered while working on unit tests for
another test and fixing the data generation function to actually
generate partition tombstones.
This patch fixes both problems in the handling of partition tombstones
but it doesn't yet fixes the test. That is deferred until the patch
series which uncovered this bug is merged to avoid merge conflicts.
The other series mentioned here is: [PATCH v6 00/15] compaction: allow
collecting purged data
Fixes: #4683
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190710092427.122623-1-bdenes@scylladb.com>
The get_value method returns a pointer to the value that is used by the
value_to_json method.
The assumption is that the void pointer points to the actual value.
Fixes#4678
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Since commit bb56653 (repair: Sync schema from follower nodes before
repair), the behaviour of handling down node during repair has been
changed. That is, if a repair follower is down, it will fail to sync
schema with it and the repair of the range will be skipped. This means
a range can not be repaired unless all the nodes for the replicas are up.
To fix, we filter out the nodes that is down and mark the repair is
partial and repair with the nodes that are still up.
Tests: repair_additional_test:RepairAdditionalTest.repair_with_down_nodes_2b_test
Fixes: #4616
Backports: 3.1
Message-Id: <621572af40335cf5ad222c149345281e669f7116.1562568434.git.asias@scylladb.com>
A read which arrived to a non-replica and had to be forwarded to a
replica by the coordinator is accounted in an own metric,
reads_coordinator_outside_replica_set.
Most often such read is produced by a driver which is unaware of
token distribution on the ring.
If a read was forwarded to another replica due to heat weighted
load balancing or query preference set by the user, it's not accounted
in the metric.
In case of a multi-partition read (a query using IN statement,
e.g. x in (1, 2, 3)), if any of the keys is read from a
non-local node the read is accounted as a non-local.
The rationale behind it is that if the user tries to be careful and send
IN queries only to the same vnode, they are rewarded with the counter
staying at zero, while if they send multi-partition IN queries without
any precautions, they will see the metric go up which gives them a
starting point for investigating performance problems.
Closes#4338
Add a metric to account writes which arrived to a non-replica and
had to be forwarded by a coordinator to a replica.
The name of the added metric is 'writes_coordinator_outside_replica_set'.
Do not account forwarded read repair writes, since they are already
accounted by a reads_coordinator_outside_replica_set metric, added in a
subsequent patch.
In scope of #4338.
Because inet_address was initially hardcoded to
ipv4, its wire format is not very forward compatible.
Since we potentially need to communicate with older version nodes, we
manually define the new serial format for inet_address to be:
ipv4: 4 bytes address
ipv6: 4 bytes marker 0xffffffff (invalid address)
16 bytes data -> address
The std::find_if and std::copy can stall if the number of rows are big.
Introduce a helper move_row_buf_to_working_row_buf to move the rows
that yields to avoid stall.
Simliar to commit 9785754e0d
lister::guarantee_type needs to check the entry's type,
not the symlink it may point to.
Fixes#4606
The nodetool_refresh_with_wrong_upload_modes_test dtest creates a broken
symlink and following it fails, as it should, with the default follow_symlink::yes
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190626110734.4558-1-bhalevy@scylladb.com>
"
This miniseries expands big_decimal interface with convenience operators
(-=, +, -), provides test cases for it and makes one of the constructors
explicit.
Tests: unit(dev)
"
* 'expand_big_decimal_interface' of https://github.com/psarna/scylla:
utils: make string-based big decimal constructor explicit
tests: add more operators to big decimal tests
utils: add operators to big_decimal
"
Implement LIKE parsing, intermediate representation, and query processing. Add tests
for this implementation (leaving the LIKE functionality tests in
tests/like_matcher_test.cc).
Refs #4477.
"
* 'finish-like' of https://github.com/dekimir/scylla:
cql3: Add LIKE operator to CQL grammar
cql3: Ensure LIKE filtering for partition columns
cql3: Add LIKE restriction
cql3: Add LIKE relation
If the user creates a keyspace with the 'SimpleStrategy' replication class
in a multi-datacenter environment, they will receive a warning in the CQL shell
and in the server logs.
Resolves#4481.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
Partition columns are implicitly filtered whenever possible, avoiding
expensive post-processing. But there are exceptions, eg, when
partition key is only partially restricted, or for CONTAINS
expressions. Here we add LIKE to this list of exceptions.
Also fix compute_bounds() to punt on LIKE restrictions, which cannot
be translated into meaningful bounds.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This restriction leverages like_matcher to perform filtering.
Make single_column_relation::new_LIKE_restriction() return this new
restriction.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Add a new type of relation with operator LIKE. Handle it in
relation::to_restriction by introducing a new virtual method for it.
The temporary implementation of this method returns null; that will be
replaced in a subsequent patch.
Add abstract_type::is_string() to recognize string columns and
disallow LIKE operator on non-string columns.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Command line arguments are parsed twice in Scylla: once in main and once in Seastar's app_template::run.
The first parse is there to check if the "--version" flag is present --- in this case the version is printed
and the program exists. The second parsing is correct; however, most of the arguments were improperly treated
as positional arguments during the first parsing (e.g., "--network host" would treat "host" as a positional argument).
This happened because the arguments weren't known to the command line parser.
This commit fixes the issue by moving the parsing code until after the arguments are registered.
Resolves#4141.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
"
The put_row_diff, get_row_dif and get_full_row_hashes verbs are switched
to use rpc stream instead of rpc verb. They are the verbs that could
send big rpc messages. The rpc stream sink and source are created per
repair follower for each of the above 3 verbs. The sink and source are
shared for multiple requests during the entire repair operation for a
given range, so there is no overhead to setup rpc stream.
The row buffer is now increased to 32MiB from 256KiB, giving better
bandwidth in high latency links. The downside of bigger row buffer is
reduced possibility that all the rows inside a row buffer are identical.
This causes more full hashes to be exchanged. To address this issue, the
plan is to add better set reconciliation algorithm in addition to the
current send full hashes.
I compared rebuild using regular stream plan with repair using rpc
stream. With 2 nodes, 1 smp, 8M rows, delete all data on one of the
node before repair or rebuild.
repair using seastar rpc verb
Time to complete: 82.17s
rebuild using regular streaming which uses seastar rpc stream
Time to complete: 63.87s
repair using seastar rpc stream
Time to complete: 68.48s
For 1) and 3), the improvement is 16.6% (repair using rpc verb v.s. repair using rpc stream)
For 2) and 3), the difference is 7.2% (repair v.s. stream)
The result is promising for the future repair-based bootstrap/replace node operations.
NOTE: We do not actually enable rpc stream in row level repair for now. We
will enable it after we fix the the stall issues caused by handling
bigger row buffers.
Fixes#4581
"
* 'repair_switch_to_rpc_stream_v9' of https://github.com/asias/scylla: (45 commits)
docs: Add RPC stream doc for row level repair
repair: Mark some of the helper functions static
repair: Increase max row buf size
repair: Hook rpc stream version of verbs in row level repair
repair: Add use_rpc_stream to repair_meta
repair: Add is_rpc_stream_supported
repair: Add needs_all_rows flag to put_row_diff
repair: Optimize get_row_diff
repair: Register repair_get_full_row_hashes_with_rpc_strea
repair: Register repair_put_row_diff_with_rpc_stream
repair: Register repair_get_row_diff_with_rpc_stream
repair: Add repair_get_full_row_hashes_with_rpc_stream_handler
repair: Add repair_put_row_diff_with_rpc_stream_handler
repair: Add repair_get_row_diff_with_rpc_stream_handler
repair: Add repair_get_full_row_hashes_with_rpc_stream_process_op
repair: Add repair_put_row_diff_with_rpc_stream_process_op
repair: Add repair_get_row_diff_with_rpc_stream_process_op
repair: Add put_row_diff_with_rpc_stream
repair: Add put_row_diff_sink_op
repair: Add put_row_diff_source_op
...
If the cluster supports row level repair with rpc stream interface, we
can use bigger row buf size to have better repair bandwidth in high
latency links.
So we can avoid copy _working_row_buf in get_row_diff on master node if
there is only one follower node and all repair rows are needed by
follower node.
Move _working_row_buf instead of copy if it is follower node or
it is master node with only one follow. In these cases, the
_working_row_buf will not be used after this function, so we can move
it.
Fixes#4640
Iterating extensions in commitlog.cc should mimic that in sstables.cc,
i.e. a simple future-chain. Should also use same order for read and
write open, as we should preserve transformation stack order.
Message-Id: <20190702150028.18042-1-calle@scylladb.com>
On recent changes install.sh mistakenly copies dist/common/scripts to
/opt/scylladb/scripts/scripts, it should be /opt/scylladb/scripts.
Same on /opt/scylladb/scyllatop as well.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190702120030.13729-1-syuu@scylladb.com>
Use sink_source_for_repair to define sink_source_for_put_row_diff
with sink = repair_row_on_wire_with_cmd, source = repair_stream_cmd
for REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM rpc stream verb.
Use sink_source_for_repair to define sink_source_for_get_row_diff with
sink = repair_hash_with_cmd, source = repair_row_on_wire_with_cmd for
REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM rpc stream verb.
Use the sink_source_for_repair to define
sink_source_for_get_full_row_hashes with sink = repair_stream_cmd,
source = repair_hash_with_cmd for
REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM rpc stream verb.
- REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM
Get repair rows from follower nodes
- REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM
Put repair rows to follower nodes
- REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM:
Get full hashes from follower nodes
If someone opens a connection to port 9042 and sends some random bytes,
there is a 1 in 64 probability we'll recognize it as a valid frame
(since we only check the version byte, allowing versions 1-4) and we'll
try to read frame.length bytes for the body. If this value is very large,
we'll run out of memory very quickly.
Fix this by checking for reasonable body size (100kB). The initial message
must be a STARTUP, whose body is a [string map] of options, of which just
three are recognized. 100kB is plenty for future expansion.
Note that this does not replace true security on listening ports and
only serves to protect against mistakes, not attacks. An attacker can
easily exhaust server memory by opening many connections and trickle-feeding
them small amounts of data so they appear alive.
We can't use the config item native_transport_max_frame_size_in_mb,
because that can be legitimately large (and the default is atrocious,
256MB).
Fixes#4366.
This patchset allows changing the configuration at runtime, The user
triggers this by editing the configuration file normally, then
signalling the database with SIGHUP (as is traditional).
The implementation is somewhat complicated due the need to store
non-atomic mutable state per-shard and to synchronize the values in
all shards. This is somewhat similar to Seastar's sharded<>, but that
cannot be used since the configuration is read before Seastar is
initialized (due to the need to read command-line options).
Tests: unit (dev, debug), manual test with extra prints (dev)
Ref #2689Fixes#2517.
* seastar b629d5ef7a...a5b9f77d52 (6):
> perftune.py: add comment explaining why we don't log errors when binding NVMe IRQs for all but i3.nonmetal machines
> sharded: do a two phase shutdown for sharded services
> chunked_fifo: add iterator
> perftune.py: fix the i3 metal detection pattern
> core/memory: remove translation api
> reactor: file_type: offer option to not follow symbolic links
Change the type from bool to updateable_value<bool> throughout the dependency
chain and mark it as live updateable.
In theory we should also observe the value and trigger compaction if it changes,
but I don't think it is worthwhile.
Once we start using updateable_value<>, we must make it refer
to the updateable_value_source<> on the same shard, and to do
that we need to call broadcast_to_all_shards() first (this
creates the per-shard copy).
With dynamically updateable configuration, tracking the source of a value
is more important, since we'll accept or reject updates depending on the source.
Fix the source of client_encryption_options, which we RMW, by preserving the original
source.
Replace the per-shard value we store with an updateable_value_source, which
allows updating it dynamically and allows users to track changes.
The broadcast_to_all_shards() function is augmented to apply modifications
when called on a live system.
Since some of our values are not atomic (strings) and the administrative
information needed to track references to values is also not atomic, we will
need to store them per-shard. To do that we add a vector of per-shard data
to config_file, where each element is itself a vector of configuration items.
Since we need to operate generically on items (copying them from shard to shard)
we store them in a type-erased form.
Only mutable state is stored per-shard.
The updateable_value and updateable_value_source classes allow broadcasting
configuration changes across the application. The updateable_value_source class
represents a value that can be updated, and updateable_value tracks its source
and reflects changes. A typical use replaces "uint64_t config_item" with
"updateable_value<uint64_t> config_item", and from now on changes to the source
will be reflected in config_item. For more complicated uses, which must run some
callback when configuration changes, you can also call
config_item.observe(callback) to be actively notified of changes.
Dynamically updateable configuration requires checking whether configuration items
changed or not, so we can skip firing notifiers for the common case where nothing
changed.
This patch adds a comparison operator for seed_provider_type, which was missing it.
Currently, we allow adjusting configuration via
cfg.whatever() = 5;
by returning a mutable reference from cfg.whatever(). Soon, however, this operation
will have side effects (updating all references to the config item, and triggering
notifiers). While this can be done with a proxy, it is too tricky.
Switch to an ordinary setter interface:
cfg.whatever.set(5);
Because boost::program_options no longer gets a reference to the value to be written
to, we have to move the update to a notifier, and the value_ex() function has to
be adjusted to infer whether it was called with a vector type after it is
called, not before.
config_file and db::config are soon not going to be copyable. The reason is that
in order to support live updating, we'll need per-shard copies of each value,
and per-shard tracking of references to values. While these can be copied, it
will be an asycnronous operation and thus cannot be done from a copy constructor.
So to prepare for these changes, replace all copies of db::config by references
and delete config_file's copy constructor.
Some existing references had to be made const in order to adapt the const-ness
of db::config now being propagated (rather than being terminated by a non-const
copy).
Copying the config object breaks the link between the original and the copied
object, so updates to config items will not be visible. To allow updates, don't
copy any more, and instead keep a pointer.
The pointer won't work will once config is updateable, since the same object is
shared across multiple shard, but that can be addressed later.
Currently, database::_cfg is a copy of the global configuration. But this means
that we have multiple master copies of the configuration, which makes updating
the configuration harder. In order to eliminate the copy we have to eliminate the
database default constructor, which creates a config object, so that all
remaining constructors can receive config by reference and retain that reference.
We want to eliminate the default database constructor (to be explained in
the next patch), so eliminate its only use in gossip_test, using the regular
constructor instead.
With this unused descriptors and objects should always be poisoned.
* https://github.com/espindola/scylla/ align-descriptors-so-that-they-are-poisoned-v4:
Convert macros to inline functions
More precise poisoning in logalloc
On current scylla.spec, shell glob pattern "scylla_*setup" does not correctly
expanded, it mistakenly created a symlink named "/usr/sbin/scylla_*setup".
We need to expand them, need to create symlinks for each setup scripts.
Fixes#4605
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190627053530.10406-2-syuu@scylladb.com>
Since we don't install .pyc files on our package, python3 will generate .pyc
file when we launch setup script first time.
Then we will have unmanaged files under script directory, it will remain when
Scylla package upgraded / removed.
We need to compile *.py when we generate relocatable package, add compiled .pyc
files on .rpm/.deb packages.
Fixes#4612
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190627053530.10406-1-syuu@scylladb.com>
This change aligns descriptors and values to 8 bytes so that poisoning
a descriptor or value doesn't interfere with other descriptors and
values.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
When writing streamed data into sstables, while using time window
compaction strategy, we have to emit a new sstable for each time window.
Otherwise we can end up with sstables, mixing data from wildly different
windows, ruining the compaction strategy's ability to drop entire
sstables when all data within is expired. This gets worse as these mixed
sstables get compacted together with sstables that used to contain a
single time window.
This series provides a solution to this by segregating the data by its
atom's the time-windows. This is done on the new RPC streaming and the
new row-level, repair, memtable-flush and compaction, ensuring that the
segregation requirement is respected at all times.
Fixes: #2687
"
* 'segregate-data-into-sstables-by-time-window-streaming/v2.1' of ssh://github.com/denesb/scylla:
streaming,repair: restore indentation
repair: pass the data stream through the compaction strategy's interposer consumer
streaming: pass the data stream through the compaction strategy's interposer consumer
TWCS: implement add_interposer_consumer()
compaction_strategy: add add_interposer_consumer()
Add mutation_source_metadata
tests: add unit test for timestamp_based_splitting_writer
Add timestamp_based_splitting_writer
Introduce mutation_writer namespace
Most of our tests use overly simplistic schemas (`simple_schema`) or
very specialized ones that focus on exercising a specific area of the
tested code. This is fine in most places as not all code is schema
dependent, however practice has showed that there can be nasty bugs
hiding in dark corners that only appear with a schema that has a
specific combination of types.
This series introduces `tests::random_schema` a utility class for
generating random schemas and random data for them. An important goal is
to make using random schemas in tests as simple and convenient as
possible, therefore fostering the appearance of tests using random
schemas.
Random schema was developed to help testing code I'm currently working
on, which segregates data by time-windows. As I wasn't confident in my
ability to think of every possible combination of types that can break
my code I came up with random-schema to help me finding these corner
cases. So far I consider it a success, it already found bugs in my code
that I'm not sure I would have found if I had relied on specific
schemas. It also found bugs in unrelated areas of the code which proves
my point in the first paragraph.
* https://github.com/denesb/scylla.git random_schema/v5:
tests/data_model: approximate to the modeled data structures
data_value: add ascii constructor
tests/random-utils.hh: add stepped_int_distribution
tests/random-utils.hh: get_int() add overloads that accept external
rand engine
tests/random-utils.hh: add get_real()
tests: introduce random_schema
Exploit the interposer customization point to inject a consumer that will
segregate the mutation stream based on the contained atoms' timestamps,
allowing the requirements of TWCS to be mantained every time sstables
are written to disk.
For the implementation, `timestamp_based_splitting_writer` is used,
with a classifier that maps timestamps to windows.
This is a band-aid patch that is supposed to fix the immediate problem
of large collections causing large allocations. The proper fix is to
use IMR but that will take time. In the meanwhile alleviate the
pressure on the memory allocator by using a chunked storage collection
(utils::chunked_vector) instead of std::vector. In the linked issue
seastar::chunked_fifo was also proposed as the container to use,
however chunked fifo is not traversable in reverse which disqualifies
it from this role.
Refs: #3602
This will be the customization point for compaction strategies, used to
inject a specific interposer consumer that can manipulate the fragment
stream so that it satisfies the requirements of the compaction strategy.
For now the only candidate for injecting such an interposer is
time-window compaction strategy, which needs to write sstables that
only contains atoms belonging to the same time-window. By default no
interposer is injected.
Also add an accompanying customization point
`adjust_partition_estimate()` which returns the estimated per-sstable
partition-estimate that the interposer will produce.
This struct contains metadata regarding to a mutation_source. Currently
it contains the min and max timestamp. This will be used later by
compaction strategies to determine whether a given mutation stream has
to be split or not.
This writer implements the core logic of time-window based data
segregation. It splits the fragment stream provided by a reader, such
that each atom (cell) in the stream will be written into a consumer
based on the time-window its timestamp belongs to. The end result is
that each consumer will only see fragments, whoose atoms all have
timestamps belonging to the same time-window.
When a mutation fragment has atoms belonging to different time-windows,
it is split into as many fragments as needed so each has only atoms
that belong to the same time-window.
Currently there is a single mutation_writer: `multishard_writer`,
however in the next path we are going to add another one. This is the
right moment to move these into a common namespace (and folder), we
have way too much stuff scattered already in the top-level namespace
(and folder).
Also rename `tests/multishard_writer_test.cc` to
`tests/mutation_writer_test.cc`, this test-suite will be the home of all
the different mutation writer's unit test cases.
"
Currently, parser and the consumer save its state and return the
control to the caller, which then figures out that it needs to enter a
new partition, and that it doesn't need to skip. We do it twice, after
row end, and after row start. All this work could be avoided if the
consumer installed by the reader adjusted its state and pushed the
fragments on the spot. This patch achieves just that.
This results in less CPU overhead.
The ka/la reader is left still stopping after row end.
Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe):
perf_fast_forward -c1 -m1G --run-tests=small-partition-skips:
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6%
Tests: unit (dev)
"
* 'sstable-optimize-partition-scans' of https://github.com/tgrabiec/scylla:
sstable: mc: reader: Do not stop parsing across partitions
sstables: reader: Move some parser state from sstable_mutation_reader to mp_row_consumer_reader
sstables: reader: Simplify _single_partition_read checking
sstables: reader: Update stats from on_next_partition()
sstables: mutation_fragment_filter: Drop unnecessary calls to _walker.out_of_range()
sstables: ka/la: reader make push_ready_fragments() safe to call many times
sstables: mc: reader: Move out-of-range check out of push_ready_fragments()
sstables: reader: Return void from push_ready_fragments()
sstables: reader: Rename on_end_of_stream() to on_out_of_clustering_range()
sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end
O_DSYNC causes commitlog to pre-allocate each commitlog segment by writing
zeroes into it. In normal operation, this is amortized over the many
times the segment will be reused. In tests, this is wasteful, but under
the default workstation configuration with /tmp using tmpfs, no actual
writes occur.
However on a non-default configuration with /tmp mounted on a real disk,
this causes huge disk I/O and eventually a crash (observed in
schema_change_test). The crash is likely only caused indirectly, as the
extra I/O (exacerbated by many tests running in parallel) xcauses timeouts.
I reproduced this problem by running 15 copies of schema_change_test in
parallel with /tmp mounted on a real filesystem. Without this change, I
usually observe one or two of the copies crashing, with the change they
complete (and much more quickly, too).
The tracker object was a static object in repair.cc. At the time we initialize
it, we do not know the smp::count, so we have to initialize the _repairs
object when it is used on the fly.
void init_repair_info() {
if (_repairs.size() != smp::count) {
_repairs.resize(smp::count);
}
}
This introduces a race if init_repair_info is called on different
thread(shard).
To fix, put the tracker object inside the newly introduced
repair_service object which is created in main.cc.
Fixes#4593
Message-Id: <b1adef1c0528354d2f92f8aaddc3c4bee5dc8a0a.1561537841.git.asias@scylladb.com>
This is quick fix to the immediate problem of large collections causing
large allocations, triggering stalls or OOM. The proper fix is to
use IMR for storing the cells, but that is a complex change that will
require time, so let's not stall/OOM in the meanwhile.
This series makes sure new schema is propagated to repair master and
follower nodes before repair.
Fixes#4575
* dev.git asias/repair_pull_schema_v2:
migration_manager: Add sync_schema
repair: Sync schema from follower nodes before repair
"
If the database supports infinite bound range deletions,
CQL layer will no longer throw an error indicating that both ranges
need to be specified.
Fixes#432
Update test_range_deletion_scenarios unit test accordingly.
"
* 'cql3-lift-infinite-bound-check' of https://github.com/bhalevy/scylla:
cql3: lift infinite bound check if it's supported
service: enable infinite bound range deletions with mc
database: add flag for infinite bound range deletions
Piotr Sarna says:
Fixes#4540
This series adds proper handling of aggregation for paged indexed queries.
Before this series returned results were presented to the user in per-page
partial manner, while they should have been returned as a single aggregated
value.
Tests: unit(dev)
Piotr Sarna (8):
cql3: split execute_base_query implementation
cql3: enable explicit copying of query_options
cql3: add a query options constructor with explicit page size
cql3: add proper aggregation to paged indexing
cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
tests: add query_options to cquery_nofail
tests: add indexing + paging + aggregation test case
tests: add indexing+paging test case for clustering keys
* seastar ded50bd8a4...b629d5ef7a (9):
> sharded: no_sharded_instance_exception: fix grammar
> core,net: output_stream: remove redundant std::move()
> perftune: make sure that ethtool -K has a chance of succeeding
> net/dpdk: upgrade to dpdk-19.05
> perftune.py: Fix a few more places where we use deprecated pyudev.Device ones
> reactor: provide an uptime function
> rpc: add sink::flush() to streaming api
> Use a table to document the various build modes
> foreign_ptr: Fix compilation error due to unused variable
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.
Fixes#4589
Since commit "repair: Use the same schema version for repair master and
followers", repair master and followers uses the same schema version
that master decides to use during the whole repair operation. If master
has older version of schema, repair could ignore the data which makes use
of the new schema, e.g., writes to new columns.
To fix, always sync the schema agreement before repair.
The master node pulls schema from followers and applies locally. The
master then uses the "merged" schema. The followers use
get_schema_for_write() to pull the "merged" schema.
Fixes#4575
Backports: 3.1
random_schema is a utility class that provides methods for generating
random schemas as well as generating data (mutations) for them. The aim
is to make using random schemas in tests as simple and convenient as
is using `simple_schema`. For this reason the interface of
`random_schema` follows closely that of `simple_schema` to the extent
that it makes sense. An important difference is that `random_schema`
relies on `data_model` to actually build mutations. So all its
mutation-related operations work with `data_model::mutation_descrition`
instead of actual `mutation` objects. Once the user arrived at the
desired mutation description they can generate an actual mutation via
`data_model::mutation_description::build()`.
In addition to the `random_schema` class, the `random_schema.hh` header
exposes the generic utility classes for generating types and values
that it internally uses.
random_schema is fully deterministic. Using the same seed and the same
set of operations is guaranteed to result in generating the same schema
and data.
Make the the data modelling structures model their "real" counterparts
more closely, allowing the user greater control on the produced data.
The changes:
* Add timestamp to atomic_value (which is now a struct, not just an
alias to bytes).
* Add tombstone to collection.
* Add row_tombstone to row.
* Add bound kinds and tombstone to range_tombstone.
Great care was taken to preserve backward compatibility, to avoid
unnecessary changes in existing code.
If the database supports infinite bound range deletions,
CQL layer will no longer throw an error indicating that both ranges
need to be specified.
[bhalevy] Update test_range_deletion_scenarios unit test accordingly.
Fixes#432
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As soon as it's agreed that the cluster supports sstables in mc format,
infinite bound range deletions in statements can be safely enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Database can only support infinite bound range deletions if sstable mc
format is supported. As a first step to implement these checks,
an appropriate flag is added to database.
Indexed queries used to erroneously return partial per-page results
for aggregation queries. This test case used to reproduce the problem
and now ensures that there would be no regressions.
Refs #4540
Aggregated and paged filtering needs to aggregate the results
from all pages in order to avoid returning partial per-page
results. It's a little bit more complicated than regular aggregation,
because each paging state needs to be translated between the base
table and the underlying view. The routine keeps fetching pages
from the underlying view, which are then used to fetch base rows,
which go straight to the result set builder.
Fixes#4540
For internal use, there already exists a query_options constructor
that copies data from another query_options with overwritten paging
state. This commit adds an option to overwrite page size as well.
In order to handle aggregation queries correctly, the function that
returns base query results is split into two, so it's possible to
access raw query results, before they're converted into end-user
CQL message.
pow2_rank is undefined for 0.
bucket_of currently works around that by using a bitmask of 0.
To allow asserting that count_{leading,trailing}_zeros are not
called with 0, we want to avoid it at all call sites.
Fixes#4153
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190623162137.2401-1-bhalevy@scylladb.com>
"
partitioned_sstable_set is not self sufficient because it relies on
compatible_ring_position_view, which in turn relies on lifetime of
sstable object. This leads to use-after-free. Fix this problem by
introducing compatible_ring_position and using it in p__s__s.
Fixes#4572.
Test: unit (dev), compaction dtests (dev)
"
* 'projects/fix_partitioned_sstable_set/v4' of ssh://github.com/bhalevy/scylla:
tests: Test partitioned sstable set's self-sufficiency
sstables: Fix partitioned_sstable_set by making it self sufficient
Introduce compatible_ring_position and compatible_ring_position_or_view
Partitioned sstable set is not self sufficient, because it uses compatible_ring_position_view
as key for interval map, which is constructed from a decorated key in sstable object.
If sstable object is destroyed, like when compaction releases it early, partitioned set
potentially no longer works because c__r__p__v would store information that is already freed,
meaning its use implies use-after-free.
Therefore, the problem happens when partitioned set tries to access the interval of its
interval map and uses freed information from c__r__p__v.
Fix is about using the newly introduced compatible_ring_position_or_view which can hold a
ring_position, meaning that partitioned set is no longer dependent on lifetime of sstable
object.
Retire compatible_ring_position_view.hh as it is now unused.
Fixes#4572.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The motivation for supporting ring position is that containers using
it can be self sufficient. The existing compatible_ring_position_view
could lead to use after free when the ring position data, it was built
from, is gone.
The motivation for compatible_ring_position_or_view is to allow lookup
on containers that don't support different key types using c__r__p,
and also to avoid unnecessary copies.
If the user is provided only with a ring_position_view, c__r__p__or_v
could be built from it and used for lookups.
Converting ring_position_view to ring_position is very bug prone because
there could be information lost in the process.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently to_string takes raw bytes. This means that to print a
data_value it has to first be serialized to be passed to to_string,
which will then deserializes it.
This patch adds a virtual to_string_impl that takes a data_value and
implements a now non virtual to_sting on top of it.
I don't expect this to have a performance impact. It mostly documents
how to access a data_value without converting it to bytes.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190620183449.64779-3-espindola@scylladb.com>
Recently, in merge commit 2718c90448,
we added the ability to cancel pending view-update requests when we detect
that the target node went down. This is important for view updates because
these have a very long timeout (5 minutes), and we wanted to make this
timeout even longer.
However, the implementation caused a race: Between *creating* the update's
request handler (create_write_response_handler()) and actually starting
the request with this handler (mutate_begin()), there is a preemption point
and we may end up deleting the request handler before starting the request.
So mutate_begin() must gracefully handle the case of a missing request
handler, and not crash with a segmentation fault as it did before this patch.
Eventually the lifetime management of request handlers could be refactored
to avoid this delicate fix (which requires more comments to explain than
code), or even better, it would be more correct to cancel individual writes
when a node goes down, not drop the entire handler (see issue #4523).
However, for now, let's not do such invasive changes and just fix bug that
we set out to fix.
Fixes#4386.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190620123949.22123-1-nyh@scylladb.com>
The repair_rows in row_list are sorted. It is only possible for the
current repair_row to share the same partition key with the last
repair_row inserted into repair_row_on_wire. So, no need to search from
the beginning of the repair_rows_on_wire to avoid quadratic complexity.
To fix, look at the last item in repair_rows_on_wire.
Fixes#4580
Message-Id: <08a8bfe90d1a6cf16b67c210151245879418c042.1561001271.git.asias@scylladb.com>
Tests without custom flags were already being run with -m2G. Tests
with custom flags have to manually specify it, but some were missing
it. This could cause tests to fail with std::bad_alloc when two
concurrent tests tried to allocate all the memory.
This patch adds -m2G to all tests that were missing it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190620002921.101481-1-espindola@scylladb.com>
"
Fixes#4569
This series fixes the infinite paging for indexed queries issue.
Before this fix, paging indexes tended to end up in an infinite loop
of returning pages with 0 results, but has_more_pages flag set to true,
which confused the drivers.
Tests: unit(dev)
Branches: 3.0, 3.1
"
* 'fix_infinite_paging_for_indexed_queries' of https://github.com/psarna/scylla:
tests: add test case for finishing index paging
cql3: fix infinite paging for indexed queries
We still has "{{^jessie}}" tag on scylla-server systemd unit file to
skip using AmbientCapabilities on Debian 8, but it does not able to work
anymore since we moved to single binary .deb package for all debian variants,
we must share same systemd unit file across all Debian variants.
To do so we need to have separated file on /etc/systemd to define
AmbientCapabilities, create the file while running postinst script only
if distribution is not Debian 8, just like we do in .rpm.
See #3344
See #3486
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190619064224.23035-1-syuu@scylladb.com>
Thrift 0.11 changed to generate c++ code with
std::shared_ptr instead of boost::shared_ptr.
- https://issues.apache.org/jira/browse/THRIFT-2221
This was forcing scylla to stick with older versions
of thrift.
Fixes issue #3097.
thrift: add type aliases to build with old and new versions.
update to using namespace =
Since we cannot use dh --with=systemd because we don't want to
automatically enabling systemd units, manage them by our setup scripts,
we have to do 'systemctl daemon-reload' manually.
(On dh --with=systemd, systemd helper automatically provides such
scirpts)
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190618000210.28972-1-syuu@scylladb.com>
Running tests in debug mode takes 25:22.08 in my machine. Using
sanitize instead takes that down to 10:46.39.
The mode is opt in, in that it must be explicitly selected with
"configure.py --mode=sanitize" or "ninja sanitize". It must also be
explicitly passed to test.py.
Unfortunately building with asan, optimizations and debug info is
very slow and there is nothing like -gline-tables-only in gcc.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190617170007.44117-1-espindola@scylladb.com>
Currently, parser and the consumer save its state and return the
control to the caller, which then figures out that it needs to enter a
new partition, and that it doesn't need to skip. We do it twice, after
row end, and after row start. All this work could be avoided if the
consumer installed by the reader adjusted its state and pushed the
fragments on the spot. This patch achieves just that.
This results in less CPU overhead.
The ka/la reader is left still stopping after row end.
Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe):
perf_fast_forward -c1 -m1G --run-tests=small-partition-skips:
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6%
The old code was making advance_to_next_partition() behave
incorrectly when _single_partition_read, which was compensated by a
check in read_partition().
Cleaner to exit early.
Not a bug fix, just makes the implementation more robust against changes.
Before this patch this might have resulted in partition_end being
pushed many times.
Currently, calling push_ready_fragments() with _mf_filter disengaged
or with _mf_filter->out_of_range() causes it to call
_reader->on_out_of_clustering_range(), which emits the partition_end
fragment. It's incorrect to emit this fragment twice, or zero times,
so correctness depends on the fact that push_ready_fragments() is
called exactly once when transitioning between partitions.
This is proved to be tricky to ensure, especially after partition_end
starts to be emitted in a different path as well. Ensuring that
push_ready_fragments() is *NOT* called after partition_end is emitted
from consume_partition_end() becomes tricky.
After having to fix this problem many times after unrelated changes to
the flow, I decide that it's better to refactor.
This change moves the call of on_out_of_clustering_range() out of
push_ready_fragments(), making the latter safe to call any number of
times.
The _mf_filter->out_of_range() check is moved to sites which update
the filter.
It's also good because it gets rid of conditionals.
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.
This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.
Indexed queries need to translate between view table paging state
and base table paging state, in order to be able to page the results
correctly. One of the stages of this translation is overwriting
the paging state obtained from the base query, in order to return
view paging state to the user, so it can be used for fetching next
pages. Unfortunately, in the original implementation the paging
state was overwritten only if more pages were available,
while if 'remaining' pages were equal to 0, nothing was done.
This is not enough, because the paging state of the base query
needs to be overwritten unconditionally - otherwise a guard paging state
value of 'remaining == 0' is returned back to the client along with
'has_more_pages = true', which will result in an infinite loop.
This patch correctly overwrites the base paging state unconditionally.
Fixes#4569
This patch set fixes repair nodes using different schema version and
optimizes the hashing thanks to the fact now all nodes uses same schema
version.
Fixes: #4549
* seastar-dev.git asias/repair_use_same_schema.v3:
repair: Use the same schema version for repair master and followers
repair: Hash column kind and id instead of column name and type name
It is guaranteed repair nodes use the same schema. It is faster to hash
column kind and id.
Changing the hashing of mutation fragment causes incompatibility with
mixed clusters. Let's backport to the 3.1 release, which includes row
level repair for the first time and is not released yet.
Refs: #4549
Backports: 3.1
Before this patch, repair master and followers use their own schema
version at the point repair starts independently. The schemas can be
different due to schema change. Repair uses the schema to serialize
mutation_fragment and deserialize the mutation_fragment received from
peer nodes. Using different schema version to serialize and deserialize
cause undefined behaviour.
To fix, we use the schema the repair master decides for all the repair
nodes involved.
On top of this patch, we could do another step to make sure all nodes
has the latest schema. But let's do it in a separate patch.
Fixes: #4549
Backports: 3.1
The intention is just to document what is currently done. If someone
wants to propose changes, that can be done after the current practices
have been documented.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190524135109.29436-1-espindola@scylladb.com>
The code that decides whether a query should used indexing was buggy - a partition key index might have influenced the decision even if the whole partition key was passed in the query (which effectively means that indexing it is not necessary).
Fixes#4539
Closes https://github.com/scylladb/scylla/pull/4544
Merged from branch 'fix_deciding_whether_a_query_uses_indexing' of git://github.com/psarna/scylla
tests: add case for partition key index and filtering
cql3: fix deciding if a query uses indexing
Currently NIC selection prompt on scylla_setup just proceed setup when
user just pressed Enter key on the prompt.
The prompt should ask NIC name again until user input correct NIC name.
Fixes#4517
Message-Id: <20190617124925.11559-1-syuu@scylladb.com>
"
These patches fix remaining issues with gcc9 build, that involve a gcc9 bug, a gcc9 bug, and a stricter warning.
Tests: unit(debug, dev, release).
"
* 'fix-gcc9-build' of https://github.com/pdziepak/scylla:
dht/ring_position: silence complaints about uninitialised _token_bound
xx_hasher: disable -Warray-bounds
api/column_family: work around gcc9 bug in seastar::future<std::any>
Currently, calling unfreeze() using the wrong version of the schema
results in undefined behavior. That can cause hard-to-debug
problems. Better to throw in such cases.
Refs #4549.
Tests:
- unit (dev)
Message-Id: <1560459022-23786-1-git-send-email-tgrabiec@scylladb.com>
In release mode gcc9 has a false positive warning about out of bound
access in xxhash implementation:
./xxHash/xxhash.c:799:27: error: array subscript -3 is outside array bounds of ‘long unsigned int [1]’ [-Werror=array-bounds]
This is solved by disabling -Warray-bounds in the xxhash code.
There is a gcc9 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90415
that makes it impossible to pass std::any through a seastar::future<T>.
Fortunately, there is only one user of seastar::future<std::any> in
Scylla and it is not performance-critical. This patch avoids the gcc9
bug by using seastar::future<std::unique_ptr<std::any>>.
We saw a node crashing today with nodetool clearsnapshot being called.
After investigation, the reason is that nodetool clearsnapshot ws called
at the same time a new snapshot was created with the same tag. nodetool
clearsnapshot can't delete all files in the directory, because new files
had by then been created in that directory, and crashes on I/O error.
There are, many problems with allowing those operations to proceed in
parallel. Even if we fix the code not to crash and return an error on
directory non-empty, the moment they do any amount of work in parallel
the result of the operation becomes undefined. Some files in the
snapshot may have been deleted by clear, for example, and a user may
then not be able to properly restore from the backup if this snapshot
was used to generate a backup.
Moreover, although we could lock at the granularity of a keyspace or
column family, I think we should use a big hammer here and lock the
entire snapshot creation/deletion to avoid surprises (for example, if a
user requests creation of a snapshot for all keyspaces, and another
process requests clear of a single keyspace)
Fixes#4554
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190614174438.9002-1-glauber@scylladb.com>
In practice, we always want to use the same sanitizer flags with
seastar and scylla. Seastar was already marking its sanitizer flags
public, so what was missing was exporting the link flags via pkgconfig
and dropping the duplicates from scylla.
I am doing this after wasting some time editing the wrong file.
This depends on the seastar patch to export the sanitizer flags in
pkgconfig.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We used to use /opt/scylladb just for Scylla build toolchain and
dependency libraries, not for Scylla main package.
But since we merged relocatable package, Scylla main binary and
dependency libraries are all located under /opt/scylladb, only
setup scripts remained on /usr/lib/scylla.
It strange to keep using both /usr/lib/<app name> and /opt/<app name>,
we should merge them into single place.
Message-Id: <20190614011038.17827-1-syuu@scylladb.com>
Before this patch mc sstables writer was ignoring
empty cellpaths. This is a wrong behaviour because
it is possible to have empty key in a map. In such case,
our writer creats a wrong sstable that we can't read back.
This is becaus a complex cell expects cellpath for each
simple cell it has. When writer ignores empty cellpath
it writes nothing and instead it should write a length
of zero to the file so that we know there's an empty cellpath.
Fixes#4533
Tests: unit(release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <46242906c691a56a915ca5994b36baf87ee633b7.1560532790.git.piotr@scylladb.com>
Consider
master: row(pk=1, ck=1, col=10)
follower1: row(pk=1, ck=1, col=20)
follower2: row(pk=1, ck=1, col=30)
When repair runs, master fetches row(pk=1, ck=1, col=20) and row(pk=1,
ck=1, col=30) from follower1 and follower2.
Then repair master sends row(pk=1, ck=1, col=10) and row(pk=1, ck=1,
col=30) to follower1, follower1 will write the row with the same
pk=1, ck=1 twice, which violates uniqueness constraints.
To fix, we apply the row with same pk and ck into the previous row.
We only needs this on repair follower because the rows can come from
multiple nodes. While on repair master, we have a sstable writer per
follower, so the rows feed into sstable writer can come from only a
single node.
Tests: repair_additional_test.py:RepairAdditionalTest.repair_same_row_diff_value_3nodes_test
Fixes: #4510
Message-Id: <cb4fbba1e10fb0018116ffe5649c0870cda34575.1560405722.git.asias@scylladb.com>
On repair follower node, only decorated_key_with_hash and the
mutation_fragment inside repair_row are used in apply_rows() to apply
the rows to disk. Allow repair_row to initialize partially and throw if
the uninitialized member is accessed to be safe.
Message-Id: <b4e5cc050c11b1bafcf997076a3e32f20d059045.1560405722.git.asias@scylladb.com>
To provide test reproducibility use the seastar local_random_engine.
To reproduce a run, use the --random-seed command line option
with the seed printed accordingly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190613114307.31038-1-bhalevy@scylladb.com>
When a column is not present in the select clause, but used for
filtering, it usually needs to be fetched from replicas.
Sometimes it can be avoided, e.g. if primary key columns form a valid
prefix - then, they will be optimized out before filtering itself.
However, clustering key prefix can only be qualified for this
optimization if the whole partition key is restricted - otherwise
the clustering columns still need to be present for filtering.
This commit also fixes tests in cql_query_test suite, because they now
expect more values - columns fetched for filtering will be present as
well (only internally, the clients receive only data they asked for).
Fixes#4541
Message-Id: <f08ebae5562d570ece2bb7ee6c84e647345dfe48.1560410018.git.sarna@scylladb.com>
Relocation of python scripts mentions scylla-server in paths explicitly.
It should use {{product}} instead. The current build is failing when
{{product}} is different than scylla-server
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190613012518.28784-1-glauber@scylladb.com>
On branch-3.1 / master, we are getting following error:
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/data: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/hints: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/commitlog: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/view_hints: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
It seems like owner verification of data directory fails because
scylla-server process is running in root but data directory owned by
scylla, so we should run services as scylla user.
Fixes#4536
Message-Id: <20190611113142.23599-1-syuu@scylladb.com>
To avoid 'Bad permmisons' error when user changed default umask, we need
to verify system umask is acceptable for scylla-server.
Fixes#4157
Message-Id: <20190612130343.6043-1-syuu@scylladb.com>
* seastar 253d6cb...ded50bd (14):
> Only export sanitizer flags if used
> perftune.py: use pyudev.Devices methods instead of deprecated pyudev.Device ones
> Add a Sanitize build mode
> Merge "perftune.py : new tuning modes" from Vlad
> reactor: clarify how submit_to() destroys the function object
> Export the sanitizer flags via pkgconfig
> smp: Delete unprocessed work items
> iotune: fixed finding mountpoint infinite loop
> net: Fix dereferencing moved object
> Always enable the exception scalability hack
> Merge "Simple cleanups in future.hh" from Rafael
> tests: introduce testing::local_random_engine
> core/deleter: Fix abort when append() is called twice with a shared deleter
> rpc stream: do not crash if a stream is used after eos
Currently, REPAIR_GET_COMBINED_ROW_HASH RPC verb returns only the
repair_hash object. In the future, we will use set reconciliation
algorithm to decode the full row hashes in working row buf. It is useful
to return the number of rows inside working row buf in addition to the
combined row hashes to make sure the decode is successful.
It is also better to use a wrapper class for the verb response so we can
extend the return values later more easily with IDL.
Fixes#4526
Message-Id: <93be47920b523f07179ee17e418760015a142990.1559771344.git.asias@scylladb.com>
With this patch, when using asan, we poison segment memory that has
been allocated from the system but should not be accessible to user
code.
Should help with debugging user after free bugs.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190607140313.5988-1-espindola@scylladb.com>
The code that decides whether a query should used indexing
was buggy - a partition key index might have influenced the decision
even if the whole partition key was passed in the query (which
effectively means that indexing it is not necessary).
Fixes#4539
Introduced in 513d01d53e
The script is trying to determine the branch to shallow clone
when an rpm is missing and has to be built.
This functionality in the current implementation assumes it is being run inside
a git repository, but that must not be the case if the script is triggered after
local rpms were placed on the local directory.
This happens when putting all necessary rpm files in: dist/ami/files
And then running: dist/ami/build_ami.sh --localrpm
The dist/ami/ and dist/ami/files are the only ones required for this action so
querying the git repository in that situation makes no sense.
Fixes#4535
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190611112455.13862-1-bhalevy@scylladb.com>
Build progress virtual reader uses Scylla-specific
scylla_views_builds_in_progress table in order to represent legacy
views_builds_in_progress rows. The Scylla-specific table contains
additional cpu_id clustering key part, which is trimmed before
returning it to the user. That may cause duplicated clustering row
fragments to be emitted by the reader, which may cause undefined
behaviour in consumers. The solution is to keep track of previous
clustering keys for each partition and drop fragments that would cause
duplication. That way if any shard is still building a view, its
progress will be returned, and if many shards are still building, the
returned value will indicate the progress of a single arbitrary shard.
Fixes#4524
Tests:
unit(dev) + custom monotonicity checks from tgrabiec@scylladb.com
Build progress virtual reader uses Scylla-specific
scylla_views_builds_in_progress table in order to represent
legacy views_builds_in_progress rows. The Scylla-specific table contains
additional cpu_id clustering key part, which is trimmed before returning
it to the user. That may cause duplicated clustering row fragments to be
emitted by the reader, which may cause undefined behaviour in consumers.
The solution is to keep track of previous clustering keys for each
partition and drop fragments that would cause duplication. That way if
any shard is still building a view, its progress will be returned,
and if many shards are still building, the returned value will indicate
the progress of a single arbitrary shard.
Fixes#4524
Tests:
unit(dev) + custom monotonicity checks from <tgrabiec@scylladb.com>
All Scylla code is written with "using namespace seastar", i.e., no
"seastar::" prefix for Seastar symbols. Document this in the coding style.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190610203948.18075-1-nyh@scylladb.com>
Fixes#4525
req_param uses boost::lexical cast to convert text->var.
However, lexical_cast does not handle textual booleans,
thus param=true causes not only wrong values, but
exceptions.
Message-Id: <20190610140511.15478-1-calle@scylladb.com>
Currently, each shard protects itself by not reading from rpc and the native
transport if in-flight requests consume too much memory for that shard. However,
if all shards then forward their requests to some other shard, then that shard
can easily run out of memory since its load can be multiplied by the number of
shards that send it requests.
To protect against this, use the new Seastar smp_service_group infrastructure.
We create three groups: read, write, and write ack (the latter is needed to
avoid ABBA deadlocks is shard A exhausts all its resources sending writes to shard B,
and shard B simulateously does the same; neither will be able to send
acknowledgements, so if the writes are throttled, they will never be unthrottled
until a timeout occurs).
Range scans are not addressed by this patch since they are handled by
multishard_mutation_query, which has its own complex cross-shard communication
scheme, but it be a similar solution.
Ref #1105 (missing range scan protection)
Tests: unit (dev)
Message-Id: <20190512142243.17795-1-avi@scylladb.com>
If a port value passed as a string this makes the cluster.connect() to
fail with Python3.4.
Let's fix this by explicitly declaring a 'port' argument as 'int'.
Fixes#4527
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190606133321.28225-1-vladz@scylladb.com>
We want to use the same branch on the other repos build-ami needs
as the one we're building for. Automatically find the current branch
using the `git branch` command.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190606133648.15877-1-bhalevy@scylladb.com>
This patch adds a warning option to the user for situations where
rows count may get bigger than initially designed. Through the
warning, users can be aware of possible data modeling problems.
The threshold is initially set to '100,000'.
Tests: unit (dev)
Message-Id: <20190528075612.GA24671@shenzou.localdomain>
A lot of code in scylla is only reachable if SEASTAR_DEFAULT_ALLOCATOR
is not defined. In particular, refill_emergency_reserve in the default
allocator case is empty, but in the seastar allocator case it compacts
segments.
I am trying to debug a crash that seems to involve memory corruption
around the lsa allocator, and being able to use a debug build for that
would be awesome.
This patch reduces the differences between the two cases by having a
common segment_pool that defers only a few operations to different
segment_store implementations.
Tests: unit (debug, dev)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190606020937.118205-1-espindola@scylladb.com>
Prometheus needs to remember which "instance" (node) each measurement
came from. But it doesn't actually need Scylla to tell it the instance
name - it knows which node it got each measurement from.
After Seastar commit 79281ef287
which fixed Seastar issue https://github.com/scylladb/seastar/issues/477,
the "instance" label on measurements no longer comes from Scylla but rather
is added by Prometheus. This patch corrects the documentation to explain the
current situation, instead of incorrectly saying that Scylla adds the
"instance" label itself.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190602074629.14336-1-nyh@scylladb.com>
The relocatable Python is built from Fedora packages. Unfortunately TLS
certificates are in a different location on Debian variants, which
causes "node_exporter_install" to fail as follows:
Traceback (most recent call last):
File "/usr/lib/scylla/libexec/node_exporter_install", line 58, in <module>
data = curl('https://github.com/prometheus/node_exporter/releases/download/v{version}/node_exporter-{version}.linux-amd64.tar.gz'.format(version=VERSION), byte=True)
File "/usr/lib/scylla/scylla_util.py", line 40, in curl
with urllib.request.urlopen(req) as res:
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 1360, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 1319, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>
Unable to retrieve version information
node exporter setup failed.
Fix the problem by overriding the SSL_CERT_FILE environment variable to
point to the correct location of the TLS bundle.
Message-Id: <20190604175434.24534-1-penberg@scylladb.com>
The to_json_string utility implementation was based on const references
instead of views, which can be a source of unnecessary memory copying.
This patch migrates all to_json_string to use bytes_view and leaves
the const reference version as a thin wrapper.
Message-Id: <2bf9f1951b862f8e8a2211cb4e83852e7ac70c67.1559654014.git.sarna@scylladb.com>
"
Technically queue_reader already exists, however so far it was a
private utility in `multishard_writer.cc`. This mini-series makes it
public and generally useful. The interface is made safer and simpler and
the implementation is improved so it doesn't have two separate buffers.
Also, unit tests are added.
Tests: mutation_reader_test:debug/test_queue_reader, multishard_writer_test:debug
"
* 'queue_reader/v2' of https://github.com/denesb/scylla:
queue_reader: use the reader's buffer as the queue
Make queue_reader public
The queue reader currently uses two buffers, a `_queue` that the
producer pushes fragments into and its internal `_buffer` where these
fragments eventually end up being served to the consumer from.
This double buffering is not necessary. Change the reader to allow the
producer to push fragments directly into the internal `_buffer`. This
complicates the code a little bit, as the producer logic of
`seastar::queue` has to be folded into the queue reader. On the other
hand this introduces proper memory consumption management, as well as
reduces the amount of consumed memory and eliminates the possibility of
outside code mangling with the queue. Another big advantage of the
change is that there is now an explicit way to communicate the EOS
condition, no need to push a disengaged `mutation_fragment_opt`.
The producer of the queue reader now pushes the fragments into the
reader via an opaque `queue_reader_handle` object, which has the
producer methods of `seastar::queue`.
Existing users of queue readers are refactored to use the new interface.
Since the code is more complex now, unit tests are added as well.
We have a script in tree that fixes the schema for distributed system
tables, like tracing, should they change their schema. We use it all the
time but unfortunately it is not distributed with the scylla package,
which makes it using it harder (we want to do this in the server, but
consistent updates will take a while).
One of the problems with the script today that makes distributing it
harder is that it uses the python3 cassandra driver, that we don't want
to have as a server dependency. But now with the relocatable packages in
place there is no reaso not to just add it.
[avi: adjust tools/toolchain/image to point to a new image with
python3-cassandra-driver]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190603162447.24215-1-glauber@scylladb.com>
Unlike CentOS, Debian variants has python3 package on official repository,
so we don't have to use relocatable python3 on these distributions.
However, official python3 version is different on each distribution, we may
have issue because of that.
Also, our scripts and packaging implementation are becoming presuppose
existence of relocatable python3, it is causing issue on Debian
variants.
Switching to relocatable python3 on Debian variants avoid these issues,
it will easier to manage Scylla python3 environments accross multiple
distributions.
Fixes#4495
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190531112707.20082-1-syuu@scylladb.com>
Building on Ubuntu 18 or 19 following the current build instructions
doesn't work. Add information about a few pitfalls. Switch README.md
to recommending dbuild and move the details to HACKING.md.
Message-Id: <20190520152738.GA15198@atlas>
Unlike CentOS, Debian variants has python3 package on official repository,
so we don't have to use relocatable python3 on these distributions.
However, official python3 version is different on each distribution, we may
have issue because of that.
Also, our scripts and packaging implementation are becoming presuppose
existence of relocatable python3, it is causing issue on Debian
variants.
Switching to relocatable python3 on Debian variants avoid these issues,
it will easier to manage Scylla python3 environments accross multiple
distributions.
Fixes#4495
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190526105138.677-1-syuu@scylladb.com>
Apparently we are having some issues running iotune in the i3en instances,
as the values not always make sense. We believe it is something that XFS
is doing, and running fio directly on the device (no filesystem) provides
more meaningful results (more according to AWS published expected values).
For now, let's use fio instead. In this patch I have ran fio for our 4
dimensions in each of the three types of disks (large, xlarge, 3xlarge).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190524111454.27956-1-glauber@scylladb.com>
"
Before this patchset empty counters were incorrectly persisted for
MC format. No value was written to disk for them. The correct way
is to still write a header that informs the counter is empty.
We also need to make sure that reading wrongly persisted empty
counters works because customers may have sstables with wrongly
persisted empty counters.
Fixes#4363
"
* 'haaawk/4363/v3' of github.com:scylladb/seastar-dev:
sstables: add test for empty counters
docs: add CorrectEmptyCounters to sstable-scylla-format
sstables: Add a feature for empty counters in Scylla.db.
sstables: Write header for empty counters
sstables: Remove unused variables in make_counter_cell
sstables: Handle empty counter value in read path
"
One local utility function in cql_query_test.cc duplicates an existing
exception_predicate member. Another can be generalized for wider use
in the future. This patch accomplishes both, retiring a to-do item.
Tests: unit (dev)
"
* 'use-utils-predicate-in-cql_test' of https://github.com/dekimir/scylla:
tests/cql: Replace equery() with cquery_nofail()
tests: Add cquery_nofail() utility
tests: Drop redundant function
Most tests await the result of cql_test_env::execute_cql(). Most
would also benefit from reporting errors with top-level location
included.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
* seastar 3f7a5e1...5cb1234 (5):
> build: Help Seastar to find Boost on Fedora 30
> Merge 'Reinstate nowait aio support' from Avi
> Fix documentation link in README.md
> sharded: add variants to invoke_on() that accept an smp_service_group
> improve error message on AIO setup failure
Due to a bug in an sstable writer, empty counters
were stored without a header.
Correct way of storing empty counter is to still write
a header that indicates the emptiness.
Next patch in this series fixes the write path
but we have to make sure that we handle incorrectly
serialized counters in the read path becuase there
may exist sstables with counters stored without header.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Users may want to know which version of packages are used for the AMI,
it's good to have it on AMI tags and description.
To do this, we need to download .rpm from specified .repo, extract
version information from .rpm.
Fixes#4499
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190520123924.14060-2-syuu@scylladb.com>
Refs #3929
Optionally enables O_DSYNC mode for segment files, and when
enabled ignores actual flushing and just barriers any ongoing
writes.
Iff using O_DSYNC mode, we will not only truncate the file
to max size, but also do an actual initial write of zero:s
to it, since XFS (intended target) has observably less good
behaviour on non-physical file blocks. Once written (and maybe
recycled) we should have rather satisfying throughput on writes.
Note that the O_DSYNC behaviour is hidden behind a default
disabled option. While user should probably seldom worry about
this, we should add some sort of logic i main/init that unless
specified by user, evaluates the commitlog disk and sets this
to true if it is using XFS and looks ok. This is because using
O_DSYNC on things like EXT4 etc has quite horrible performance.
All above statements about performance and O_DSYNC behaviour
are based on a sampling of benchmark results (modified fsqual)
on a statistically non-ssignificant selection of disks. However,
at least there the observed behaviour is a rather large
difference between ::fallocate:ed disk area vs. actually written
using O_DSYNC on XFS, and O_DSYNC on EXT4.
Note also that measurements on O_DSYNC vs. no O_DSYNC does not
take into account the wall-clock time of doing manual disk flush.
This is intentionally ignored, since in the commitlog case, at
least using periodic mode, flushes are relatively rare.
Message-Id: <20190520120331.10229-1-calle@scylladb.com>
"
Add a fallback-mode that can be used when the `scylla ptr` cannot be
used, either because the application is not built with the seastar
allocator, or due to bugs. The fallback mode relies on a more primitive
method for determining how much memory to scan looking for task pointers
inside the task object. This mode, being more primitive, is less prone
to errors, but is more wasteful and less precise.
"
* 'scylla-fiber-fallback-mode/v2' of https://github.com/denesb/scylla:
scylla-gdb.py: scylla_fiber: add fallback mode
scylla-gdb.py: scylla_ptr: add is_seastar_allocator_used()
scylla-gdb.py: pointer_metadata: allow constructing from non-seastar pointers
scylla-gdb.py: scylla_fiber: fix misaligned text in docstring
We compile Seastar unconditionally so that changes to Seastar files are
reflected in Scylla when it's built.
We don't need to unconditionally build `iotune` in the same way.
`iotune` is still listed as a build artifact, so it will be built if
`ninja` is invoked without a particular target.
However, building a specific target (like `ninja build/dev/scylla`) will
not build `iotune`.
Fixes#4165
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <9fb96a281580a8743e04d5dd11398be53960cb58.1558100815.git.jhaberku@scylladb.com>
"
This patchset fixes reactor stalls caused by cache invalidation not being preemptible.
This becomes a problem when there is a lot of partitions in cache inside the invalidated range.
This affects high-level operations like nodetool refresh, table
truncation, repair and streaming.
Fixes#2683
The improvement on stalls was measured using tests/perf_row_cache_update:
Before:
Small partitions, no overwrites:
invalidation: 339.420624 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 339.422144 [ms]}
Small partition with a few rows:
invalidation: 191.855331 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 191.856816 [ms]}
Large partition, lots of small rows:
invalidation: 0.959328 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 0.961453 [ms]}
After:
Small partitions, no overwrites:
invalidation: 400.505554 [ms], preemption: {count: 843, 99%: 0.545791 [ms], max: 0.502340 [ms]}
Small partition with a few rows:
invalidation: 306.352600 [ms], preemption: {count: 644, 99%: 0.545791 [ms], max: 0.506464 [ms]}
Large partition, lots of small rows:
invalidation: 0.963660 [ms], preemption: {count: 2, 99%: 0.009887 [ms], max: 0.963264 [ms]}
The maximum scheduling latency went down form 339 ms to 0.5 ms (task quota).
Tests:
- unit (dev)
"
* tag 'cache-preemptible-invalidation-v2' of github.com:tgrabiec/scylla:
row_cache: Make invalidate() preemptible
row_cache: Switch _prev_snapshot_pos to be a ring_position_ext
dht: Introduce ring_position_ext
dht: ring_position_view: Take key by const pointer
tests: perf_row_cache_update: Rename 'stall' to 'preemption' to avoid confusion
tests: perf_row_cache_update: Report stalls around invalidation
To make product name templatization works correctly, we cannot use
"debian/scylla-server" as package contents directory path,
need to use template like "debian/{{product}}-server" instead.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190517121946.18248-1-syuu@scylladb.com>
The calculation is done in a non preemptable loop over all tables, so if
numbers of tables is very large it may take a while since we also build
a string for gossiper state. Make the loop preemtable and also make
the string calculation more efficient by preallocating memory for it.
Message-Id: <20190516132748.6469-3-gleb@scylladb.com>
invoke_on_all() copies provided function for each shard it is executed
on, so by moving stats map into the capture we copy it for each shard
too. Avoid it by putting it into the top level object which is already
captured by reference.
Message-Id: <20190516132748.6469-2-gleb@scylladb.com>
All `table` instances currently unconditionally allocate a cell locker
for counter cells, though not all need one. Since the lockers occupy
quite a bit of memory (as reported in #4441), it's wasteful to
allocate them when unneeded.
Fixes#4441.
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190515190910.87931-1-dejan@scylladb.com>
"
This series contains loosely related generic cleanup patches that the
timestamp based data segregation series depends on. Most of the patches
have to do with making headers self-sustainable, that is compilable on
their own. This was needed to be able to ensure that the new headers
introduced or touched by that series are self-sustainable too.
This series also introduces `schema_fwd.hh` which contains a forward
declaration of `schema` and `schema_ptr` classes. No effort was made to
find and replace all existing ad-hoc schema forward declarations in the
source tree.
"
* 'pre-timestamp-based-data-segregation-cleanup/v1' of https://github.com/denesb/scylla:
encoding_stats.hh: add missing include
sstables/time_window_compaction_strategy.hh: make self-sufficient
sstables/size_tiered_compaction_strategy.hh: make self-sufficient
sstables/compaction_strategy_impl.hh: make header self-sufficient
compaction_strategy.hh: use schema_fwd.hh
db/extensions.hh: use schema_fwd.hh
Add schema_fwd.hh
Makes opening all sstable components go through same file open
routine, optionally applying extensions to each (except TOC which
is special).
Also ensures we read Scylla metadata before other non-TOC
components, as we might need this for extensions (hint hint).
Message-Id: <20190513201821.14417-1-calle@scylladb.com>
The current implementation of the `scylla fiber` command relies on the
`scylla ptr` command to provide metadata on pointers, more
specifically the boundaries of the region the object they point to
occupies. However, in debug mode, seastar is using the standard allocator
and thus the `scylla ptr` command doesn't work.
To work around this, provide a fallback mode for debug builds. This mode
assumes pointers point to the start of objetcts and scans a
configurable region of memory. While less exact than the variant relying
on `scylla ptr` it still works reasonably well.
The size of the to-be-scanned memory region can be set using the
`--scanned-region-size` command line argument. This defaults to 512.
Additionally, add a flag (`--force-fallback-mode`) to force using the
fallback mode. This is useful if `scylla ptr` is not working for any
reason.
Determines whether the application is using the seastar allocator or
not. This is done by attempting to resolve the
`seastar::memory::cpu_mem` symbol. To avoid the expensive symbol lookup
the result is cached. This means that loading a new inferior will
possibly return the wrong value. The cache can be flushed by re-sourcing
the `scylla-gdb.py` script.
"
Although CQL allows SELECT statements with both simple and aggregate
selectors, Scylla disallows them. This patch removes that restriction
and ensures that mixed simple/aggregate selection works as specified
both with and without GROUP BY.
Tests: unit (dev)
"
* 'aggregate-and-simple-select-together' of https://github.com/dekimir/scylla:
cql: Fix mixed selection with GROUP BY
cql: Allow mixing of aggregate and simple selectors
GROUP BY is currently supported by simple_selection, the class used
when all selectors are simple. But when selectors are mixed, we use
selection_with_processing, which does not yet support GROUP BY. This
patch fixes that.
It also adapts one testcase in filtering_test to the new behavior of
simple_selector. The test currently expects the last value seen, but
simple_selector now outputs the first value seen.
(More details: the WHERE clause implicitly selects the columns it
references, and unit tests are forced to provide expected values for
these columns. The user-visible result is unchanged in the test;
users never see the WHERE column values due to filtering in
cql::transport, outside unit tests.)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Scylla currently rejects SELECT statements with both simple and
aggregate selectors, but Cassandra allows them. This patch brings
parity to Scylla.
Fixes#4447.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
Refs #4485.
"
* tag 'fix-gc_clock-digest-v2.1' of github.com:tgrabiec/scylla:
tests: Add test which verifies that schema digest stays the same
tests: Add sstables for the schema digest test
schema_tables, storage_service: Make schema digest insensitive to expired tombstones in empty partition
db/schema_tables: Move feed_hash_for_schema_digest() to .cc file
hashing: Introduce type-erased interface for the hasher
hashing: Introduce C++ concept for the hasher
hashers: Rename hasher to cryptopp_hasher
gc_clock: Fix hashing to be backwards-compatible
The _make_config_values macro reduces duplication (both the item name
and the types need to be available as C++ identifiers and as runtime
strings), but is hard to work with. The macro is huge and editors
don't handle it well, errors aren't identified at the correct
location, and since the macro doesn't have types, it's hard to
refactor.
This series replaces the macro with ordinary C++ code. Some repetition is
introduced, but IMO the result is easier to maintain than the macro. As a
bonus the bulk of the code is moved away from the header file.
Tests: unit (dev), manual testing of the config REST API
* https://github.com/avikivity/scylla config-no-macro/v2
config: make the named_value type name available without requiring
_make_config_values
config: remove value_status from named_value template parameter list
config: add named_value::value_as_json()
api: config: stop using _make_config_values
config: auto-add named_values into config_file
config: add allowed_values parameter to named_value constructor
config: convert _make_config_values to individual named_value member
declarations and initializers
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
That's not a problem during normal cluster operation because the
tombstones will expire at all nodes at the same time, and schema
digest, although changes, will change to the same value on all nodes
at about the same time.
This fix changes digest calculation to not feed any digest for
partitions which are empty after compaction.
The digest returned by schema_mutations::digest() is left unchanged by
this patch. It affects the table schema version calculation. It's not
changed because the version is calculated on boot, where we don't yet
know all the cluster features. It's possible to fix this but it's more
complicated, so this patch defers that.
Refs #4485.
Asd
The motivation is to allow hiding the definition of functions
accepting a hasher. For one, this reduces (re)complication times,
because we can put the definition in .cc
"
scylla_setup is currently broken for OEL. This happens because the
OS detection code checks for RHEL and Fedora. CentOS returns itself
as RHEL, but OEL does not.
"
* 'unbreakable' of github.com:glommer/scylla:
scylla_setup: be nicer about unrecognized OS
scylla_util: recognize OEL as part of the RHEL family
Right now if the user tries to execute this in an unrecognized OS, the
following will be thrown:
Traceback (most recent call last):
File "/usr/lib/scylla/libexec/scylla_setup", line 214, in <module>
do_verify_package('scylla-enterprise-jmx')
File "/usr/lib/scylla/libexec/scylla_setup", line 73, in do_verify_package
if res != 0:
UnboundLocalError: local variable 'res' referenced before assignment
It would be a lot nicer to exit gracefully and print a messge saying what
is going on. This was caught when running on OEL, which the previous patch
fixed. Still, there are other unknown OS out there the users may try to run
on.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Oracle Linux is a RHEL-like distribution and we support it just fine, but our
new incarnation of scylla_setup is failing to recognize it.
os-release for OEL is a bit different. It doesn't have an ID_LIKE string, and
only shows an ID string, which is set to 'ol'. So let's recognize this.
Fixes: #4493
Branches: 3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This change inserts preemption points between removal of partitions.
The main complication is in maintaining consitency in the face of
concurrent population or eviction. We use the same mechanism which is
used by memtable updates. _prev_snapshot_pos is the ring position
which partitions the ring into the part which is already updated in
cache and the one which is yet to be updated. That position should be
set accordingly on preemption.
In case of invalidation, updating means removing all entries in the
range and marking the range as discontinuous. When resuming
invalidation of a range we continue from _prev_snapshot_pos as the
lower bound.
This affects high-level operations like nodetool refresh, table
truncation, repair and streaming.
Fixes#2683
The improvement on stalls was measured using tests/perf_row_cache_update:
Before
Small partitions, no overwrites:
invalidation: 339.420624 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 339.422144 [ms]}
Small partition with a few rows:
invalidation: 191.855331 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 191.856816 [ms]}
Large partition, lots of small rows:
invalidation: 0.959328 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 0.961453 [ms]}
After:
Small partitions, no overwrites:
invalidation: 400.505554 [ms], preemption: {count: 843, 99%: 0.545791 [ms], max: 0.502340 [ms]}
Small partition with a few rows:
invalidation: 306.352600 [ms], preemption: {count: 644, 99%: 0.545791 [ms], max: 0.506464 [ms]}
Large partition, lots of small rows:
invalidation: 0.963660 [ms], preemption: {count: 2, 99%: 0.009887 [ms], max: 0.963264 [ms]}
The maximum scheduling latency went down form 339 ms to 0.5 ms (task quota).
dht::ring_position cannot represent all ring_position_view instances,
in particular those obtained from
dht::ring_position_view::for_range_start(). To allow using the latter,
switch to views.
It's an owning version of ring_position_view.
Note that ring_position has a narrower domain than the
ring_position_view for historical reasons, so we cannot use that.
* seastar f73690e...3f7a5e1 (7):
> Revert "Make sure all allocations/deallocations are properly byte aligned"
> http: fix request content for POST requests
> doc: discourage generic lambdas and unconstrained templates
> smp: add smp_service_group for smp::submit_to() resource control
> Revert "smp: add smp_service_group for smp::submit_to() resource control"
> smp: add smp_service_group for smp::submit_to() resource control
> Make sure all allocations/deallocations are properly byte aligned
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
(cherry picked from commit 549d0eb2f3)
"
This series contains fixes for GCC9 build, mostly corrections needed
after changes in libstdc++. With this series and a workaround for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90415 (not included)
Scylla builds and passes unit tests with GCC9 (tested on Fedora 30,
development mode only).
Tests: unit(dev with gcc8 and gcc9).
"
* tag 'gcc9-fixes/v1' of https://github.com/pdziepak/scylla:
tests/imr: add missing noexcept
counters: bytes_view::pointer is not const pointer
imr/fundamental: use bytes_view::const_pointer for const pointer
In libstdc++ for gcc9 std::basic_string_view::pointer isn't const any
more. As a result the compiler is complaining about reinterpret_cast
casting away const. The solution is to use std::conditional<> to choose
between const pointer for counter view and non-const pointer for mutable
counter view.
In libstdc++ shipped with gcc9 std::basic_string_view::pointer is no
longer constant, which is causing the compiler to complain about
dropping const in reinterpret_cast. The solution is to use
std::basic_string_view::const_pointer.
"
Fix the broken logic that is meant to prevent sending hints when node is
in a DOWN NORMAL state.
"
* 'hinted_handoff_stop_sending_to_down_node-v2' of https://github.com/vladzcloudius/scylla:
hints_manager: rename the state::ep_state_is_not_normal enum value
hinted handoff: fix the logic that detects that the destination node is in DN state
hinted_handoff: sender::can_send(): optimize gossiper::is_alive(ep) check
hinted handoff: end_point_hints_manager::sender: use _gossiper instead of _shard_manager.local_gossiper()
types.cc: fix the compilation with fmt v5.3.0
Before ede1d248af, running "tools/toolchain/dbuild -it -- bash" was
a nice way to play in the toolchain environment, for example to start
a debugger. But that commit caused containers to run in detached mode,
which is incompatible with interactive mode.
To restore the old behavior, detect that the user wants interactive mode,
and run the container in non-detached mode instead. Add the --rm flag
so the container is removed after execution (as it was before ede1d248af).
Message-Id: <20190506175942.27361-1-avi@scylladb.com>
In 0ea6df, duration_test was made to link against all tests/*.o files.
This isn't necessary, as it only needs tests/exception_utils.o. This
patch narrows down duration_test's dependences to only
exception_utils.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190508211630.108228-1-dejan@scylladb.com>
Add new test cases:
- disallow creating a non-FULL index on frozen collections
- disallow repeated creation of a FULL index on frozen collections
- disallow FULL indexes on non-frozen collections
- disallow referencing frozen-map entries in the WHERE clause
Also add error-message expectations to existing test cases.
Fixes#3654.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190509025806.124499-1-dejan@scylladb.com>
Since other build_*.sh are for running inside extracted relocatable
package, they have SCYLLA-PRODUCT-FILE on top of the directory,
but build_ami.sh is not running in such condition, we need to run
SCYLLA-VERSION-GEN first, then refer to build/SCYLLA-PRODUCT-FILE.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190509110621.27468-1-syuu@scylladb.com>
Rename this state value to better reflect the reality:
state::ep_state_is_not_normal -> state::ep_state_left_the_ring
The manager gets to this state when the destination Node has left the ring.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
When node is in a DN state its gossiper state may be NORMAL, SHUTDOWN
or "" depending on the use case.
In addition to that if node has been removed from the ring its state is
also going to be removed from the gossiper_state map.
Let's consider the above when deciding if node is in the DN state.
Fixes#4461
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
AWS just released their new instances, the i3en instances. The instance
is verified already to work well with scylla, the only adjustments that
we need is advertise that we support it, and pre-fill the disk
information according to the performance numbers obtained by running the
instance.
Fixes#4486
Branches: 3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190508170831.6003-1-glauber@scylladb.com>
"
Cassandra has supported GROUP BY in SELECT statements since 2016
(v3.10), while ScyllaDB currently treats it as a syntax error. To
achieve parity with Cassandra in this important bit of functionality,
this patch adds full support for GROUP BY, from parsing to validation
to implementation to testing.
"
* 'groupby-implPP' of https://github.com/dekimir/scylla:
Implement grouping in selection processing
Propagate GROUP BY indices to result_set_builder
Process GROUP BY columns into select_statement
Parse GROUP BY clause, store column identifiers
Make result_set_builder obey its _group_by_cell_indices by recognizing
group boundaries and resetting the selectors.
Also make simple_selectors work correctly when grouping.
Fixes#2206.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Ensure that the indices recorded in select_statement are passed to
result_set_builder when one is created for processing the cell values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Extend the grammar file with GROUP BY, collect the column identifiers,
and store them in raw::select_statement.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Whenever the iotune_args array uses "--smp", it needs cpudata.smp()
which returns an integer instead of a string. So when iotune_args is
passed to subprocess.check_call(), it actually throws "TypeError:
expected str, bytes or os.PathLike object, not int" but
"%s did not pass validation tests, it may not be on XFS..." is shown as
the exception.
Even though the user inputs correct arguments, it might still throw an
error and confuse the user that he/she has not passed the right
arguments.
One simple fix is to use str(cpudata.smp()) instead of cpudata.smp().
Signed-off-by: JP-Reddy <guthijp.reddy@gmail.com>
Message-Id: <20190406070118.48477-1-guthijp.reddy@gmail.com>
"
gcc 9 complains a lot about pessimizing moves, narrowing conversions, and
has tighter deduction rules, plus other nice warnings. Fix problems found
by it, and make some non-problems compile without warnings.
"
* tag 'gcc9/v1' of https://github.com/avikivity/scylla:
types: fix pessimizing moves
thrift: fix pessimizing moves
tests: fix pessimizing moves
tests: cql_query_test: silence narrowing conversion warning
test: cql_auth_syntax_test: fix ambiguity due to parser uninitialized<T>
table: fix potentially wrong schema when reading from zero sstables
storage_proxy: fix pessimizing moves
memtable: fix pessimizing moves
IDL: silence narrowing conversion in bool serializer
compaction: fix pessimizing moves
cache: fix pessimizing moves
locator: fix pessimizing moves
database: fix pessimizing moves
cql: fix pessimizing moves
cql parser: fix conversion from uninitalized<T> to optional<T> with gcc 9
We use the schema during creation of the mutation_source rather than
during the query itself. Likely they're the same, and since no rows
are returned from a zero-sstable query, harmless. But gcc 9 complains.
Fix by using the query's schema.
bool serializers are now aliases to int8_t serializers, but gcc 9
complains about narrowing conversions, due to the path int8_t -> int -> bool.
A bad narrowing conversion here cannot happen in practice, but massage
the code a little to silence it.
We use uninitialized<T> (wrapping an optional<T>) to adjust to the
parser's way of laying out the code, but this fails with gcc 9
(presumably for the correct reasons) when converting from
uninitialized<T> back to optional<T>. Add a conversion operator
to make it build.
Many tests verify exception messages. Currently, they do so via
verbose lambdas or inner functions that hide test-failure locations.
This patch adds utilities for quick creation of message-checking tests
and replaces existing ad-hoc methods with these new utilities.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190506210006.124645-1-dejan@scylladb.com>
"
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
Branches: 3.1
"
* tag 'fix-gc_clock-digest-v1' of github.com:tgrabiec/scylla:
tests: Add test which verifies that schema digest stays the same
tests: Add sstables for the schema digest test
gc_clock: Fix hashing to be backwards-compatible
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
* seastar 4cdccae...f73690e (16):
> sstring: silence technically correct but unhelpful warning in sstring move ctor
> cmake: add a seastar_supports_flag function
> future: Fix build with libc++'s non-trivially-constructible std::tuple<>
> Revert "Make sure all allocations are properly bytes aligned"
> Merge "future: simplify future_state management" from Rafael
> Make sure all allocations are properly bytes aligned
> util/log: use correct clock type
> core/reactor: don't assume system_clock::duration is in nanoseconds
> Merge "Optimize the future_state move constructor" from Rafael
> rpc: don't use boost/variant.hpp directly
> core/memory: Omit [[gnu::leaf]] attribute on clang
> Fix build with std::filesystem
> Merge "Fix clang build and tests" from Rafael
> cmake: Move ) out of quotes
> Merge "Fix some bugs found by (or perhaps in) gcc 9" by Avi
> Deduplicate Seastar dependencies management in CMake scripts
scylla-housekeeping always wants to run in the installation to check if
we are running the latest version. This happens regardless of whether or
not we said yes or no to the housekeeping scylla_setup question - as
that question only deals with whether or not we want to do this through
a timer.
It is fine to try to run scylla-housekeeping, as long as we time it out.
The current code doesn't.
The naive solution is to add a timeout parameter to urllib.request.open.
However, that timeout is not respected and in my tests I saw real
timeouts up to four times higher the timeout we set. For a reasonable 5s
timeout, this mean a 20s real timeout which can lead to a very bad user
experience. This seems to be a known problem with this module according
to a quick Google search.
This patch then takes a slightly more complex solution and uses
multiprocess to enforce a well-defined user-visible timeout.
Fixes#3980
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190506122335.5707-1-glauber@scylladb.com>
batchlog make copies of topology and endpoint array in batchlog endpoint
choosing code. There is a remark that at least endpoint copy is
deliberate because Cassandra code has it. We do not have to follow. Our
endpoint calculation code is atomic, so we can use a reference.
Message-Id: <20190506115815.GK21208@scylladb.com>
After incremental compaction, new sstables may have already replaced old
sstables at any point. Meaning that a new sstable is in-use by table and
a old sstable is already deleted when compaction itself is UNFINISHED.
Therefore, we should *NEVER* delete a new sstable unconditionally for an
interrupted compaction, or data loss could happen.
To fix it, we'll only delete new sstables that didn't replace anything
in the table, meaning they are unused.
Found the problem while auditting the code.
Fixes#4479.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190506134723.16639-1-raphaelsc@scylladb.com>
Mutations carry their schema, so use that instead of bring in a global schema,
which may change as features are added.
Message-Id: <20190505132542.6472-1-avi@scylladb.com>
perf_fast_forward is used to detect performance regressions. The two
main metrics used for this are fargments per second and the number of
the IO operations. The former is a median of a several runs, but the
latter is just the actual number of asynchronous IO operations performed
in the run that happened to be picked as a median frag/s-wise. There's
no always a direct correlation between frag/s and aio and the latter can
vary which makes the latter hard to compare.
In order to make this easier a new metric was introduced: "average aio"
which reports the average number of asynchronous IO operations performed
in a run. This should produce much more stable results and therefore
make the comparison more meaningful.
Message-Id: <20190430134401.19238-1-pdziepak@scylladb.com>
std::regex_match of the leading path may run out of stack
with long paths in debug build.
Using rfind instead to lookup the last '/' in in pathname
and skip it if found.
Fixes#4464
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190505144133.4333-1-bhalevy@scylladb.com>
The setup script asks the user whether or not housekeeping should
be called, and in the first time the script is executed this decision
is respected.
However if the script is invoked again, that decision is not respected.
This is because the check has the form:
if (housekeeping_cfg_file_exists) {
version_check = ask_user();
}
if (version_check) { do_version_check() } else { dont_do_it() }
When it should have the form:
if (housekeeping_cfg_file_exists) {
version_check = ask_user();
if (version_check) { do_version_check() } else { dont_do_it() }
}
(Thanks python)
This is problematic in systems that are not connected to the internet, since
housekeeping will fail to run and crash the setup script.
Fixes#4462
Branches: master, branch-3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190502034211.18435-1-glauber@scylladb.com>
Newer versions of RHEL ship the os-release file with newlines in the
end, which our script was not prepared to handle. As such, scylla_setup
would fail.
This patch makes our OS detection robust against that.
Fixes#4473
Branches: master, branch-3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190502152224.31307-1-glauber@scylladb.com>
"
One of the fixes is for incorrect recognition of memory pages as belonging
or not belonging to small allocation pools in some cases.
Also, compensates for https://github.com/scylladb/seastar/issues/608 in "scylla memory",
which improves accurracy of the small allocation pool report.
Fixes "scylla task_histogram" to not look into pages which do not belong to live
small allocation pool spans.
Fixes#4367Fixes#4368
"
* tag 'gdb-fix-span-qualification-v2' of github.com:tgrabiec/scylla:
gdb: Print size of large allocations in 'scylla ptr'
gdb: Fix 'scylla ptr' for free pages
gdb: Set is_live and offset for large allocations properly in 'scylla ptr'
gdb: Fix 'scylla ptr' misqualifying pointers
gdb: Make 'scylla memory' show unused memory in small pools
gdb: Fix small pool memory usage reporting in 'scylla memory'
gdb: Switch 'scylla memory' to use the span_checker to find large spans
gdb: Switch task_histogram to use the span_checker
gdb: Introduce span_checker
endpoint_filter() function assumes that each bucket of
std::unordered_multimap contains elements with the same key only, so
its size can be used to know how many elements with a particular key
are there. But this is not the case, elements with multiple keys may
share a bucket. Fix it by counting keys in other way.
Fixes#3229
Message-Id: <20190501133127.GE21208@scylladb.com>
The existing unit test test_secondary_index_contains_virtual_columns
reproduced a bug (issue #4144) with indexing of primary-key columns,
but we only actually tested clustering columns. In issue #4471 there
was a question whether we may still have a bug when indexing of
*partition-key* columns. This patch adds a test that verifies that
we don't, and this case works well too.
Refs #4144
Refs #4471
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190501113500.25900-1-nyh@scylladb.com>
gossiper::is_alive() has a lot of not needed checks (e.g. is_me(ep)) that
are irrelevant for HH use case and we may safely skip them.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Compilation fails with fmt release 5.3.0 when we print a bytes_view
using "{}" formatter.
Compiler's complain is: "error: static assertion failed: mismatch
between char-types of context and argument"
Fix this by explicitly using to_hex() converter.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
While causing some duplication (names are explicitly instead of implicitly
stringified, and names are repeated in the member declaration and initializer),
it is overall more maintainable than the huge macro. It is easier to overload
named_value constructors when you can get error reporting on the line where the error
occurs, for example.
Now that named_value::value_as_json() exists, make use of it to report the
current value of a configuration variable via the REST API, instead of
_make_config_values().
Currently, the REST API does its own conversion of named_value into json.
This requires it to use the _make_config_values macro to perform iteration
of all config items, since it needs to preserve the concrete type of the item
while iterating, so it can select the correct json conversion.
Since we want to remove that macro, we need to provide a different way to
convert a config item to json. So this patch adds a value_as_json().
To hide json_return_value from the rest of the system, we extend config_type
with a conversion function to handle the details. This usually calls
the json_return_type constructor directly, but when it doesn't have default
translation, it interposes a conversion into a type that json recognizes.
I didn't bother maintaining the existing type names, since they're C++
names which don't make sense for the UI.
The value_status is only needed at run-time, and removing it from the
template parameter list reduces type proliferation (which leads to code
bloat) and simplifies the code.
I want to remove the _make_config_values macro, but it is needed now in
api/config.cc to make the type names available. So as a first step, copy the
type names to config_src. Further changes can extract it from there.
Because we want to add more type infomation in following patches, place the type
name in a new config_type object, instead of allocating a string_view in
config_src.
Fixes runtime error which happens because the setter is expected to
take an argument, but our definition doesn't take one. We're not
really expecting the setter to be called with False, so don't use setter
semantics.
It can be that page::pool is != nullptr and page::offset_in_span is 0
for a page which is inside a large allocation span (live or
dead). This may lead to misqualification of a pointer as belonging to
a small allocation pool.
Only the first page of a span contains reliable information. This
patch changes the code to use the span_checker, which knows the real
boundaries of spans and exposes reliable information via the span
object.
Fixes#4368
It can be that page::pool is != nullptr and page::offset_in_span is 0
for a page which is inside a large allocation span (live or
dead). This may lead to misqualification of that span as belonging to
a small allocation pool and interpreting its contents as if it
contained small objects.
Only the first page of a span contains reliable information. This
patch changes the code to use the span_checker, which knows the real
boundaries of spans and exposes reliable information via the span
object.
Another problem was that the command scanned dead spans as well.
This is no longer the case after this patch.
I've seen this command report thousands of no longer live sstable
writers and various continuations because of those problems.
Fixes#4367
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) for general questions and help.
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
@@ -22,6 +22,15 @@ Scylla depends on the system package manager for its development dependencies.
Running `./install-dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
On Ubuntu and Debian based Linux distributions, some packages
required to build Scylla are missing in the official upstream:
- libthrift-dev and libthrift
- antlr3-c++-dev
Try running ```sudo ./scripts/scylla_current_repo``` to add Scylla upstream,
and get the missing packages from it.
### Build system
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native
@@ -47,7 +56,7 @@ $ ./configure.py --help
The most important option is:
-`--{enable,disable}-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
- `--enable-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
Source files and build targets are tracked manually in `configure.py`, so the script needs to be updated when new files or targets are added or removed.
@@ -55,6 +64,7 @@ To save time -- for instance, to avoid compiling all unit tests -- you can also
seastar::metrics::description("number of operations via Alternator API"),{op(CamelCaseName)}),
#define OPERATION_LATENCY(name, CamelCaseName) \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnapi_operations.name.get_histogram(1,20);}),
"summary":"Return the generation value for this node.",
"type":"int",
"type":"long",
"nickname":"get_current_generation_number",
"produces":[
"application/json"
@@ -646,7 +646,7 @@
{
"method":"POST",
"summary":"Trigger a cleanup of keys on a single keyspace",
"type":"int",
"type":"long",
"nickname":"force_keyspace_cleanup",
"produces":[
"application/json"
@@ -678,7 +678,7 @@
{
"method":"GET",
"summary":"Scrub (deserialize + reserialize at the latest version, skipping bad rows if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false",
"type":"int",
"type":"long",
"nickname":"scrub",
"produces":[
"application/json"
@@ -726,7 +726,7 @@
{
"method":"GET",
"summary":"Rewrite all sstables to the latest version. Unlike scrub, it doesn't skip bad rows and do not snapshot sstables first.",
"type":"int",
"type":"long",
"nickname":"upgrade_sstables",
"produces":[
"application/json"
@@ -800,7 +800,7 @@
"summary":"Return an array with the ids of the currently active repairs",
"type":"array",
"items":{
"type":"int"
"type":"long"
},
"nickname":"get_active_repair_async",
"produces":[
@@ -816,7 +816,7 @@
{
"method":"POST",
"summary":"Invoke repair asynchronously. You can track repair progress by using the get supplying id",
"type":"int",
"type":"long",
"nickname":"repair_async",
"produces":[
"application/json"
@@ -947,7 +947,7 @@
"description":"The repair ID to check for status",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.